Building a diagram recognition application with computer vision approach

Inonline recognition, the diagram is drawn as a sequence of strokes using a device with a inkinput device with digital pen.. The ArrowR-CNN is designed for the offline diagram recognitio

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING FACULTY

——————– * ———————

GRADUATION THESIS

Building A Diagram Recognition Application

with Computer Vision Approach

Committee: Computer Science 1

Reviewer: Dr Tran Tuan Anh

—–o0o—–

Nguyen Quang Sang 1752465

Trang 3

30"A亥w"8隠"nw壱p" p<"

ÐZ¤{"f詠pi"泳pi"f映pi"pj壱p"fk羽p"n逢w"8欝"u穎"f映pi"e ej"vk院r e壱p"e栄c"vj鵜"ik e"o {"v pjÑ

(Building A Diagram Recognition Application with Computer Vision Approach)

40"Pjk羽o"x映"*{‒w"e亥w"x隠"p瓜k"fwpi"x "u嘘"nk羽w"dcp"8亥w+<

- Investigate approaches in computer vision for diagram recognition problem

- Design the framework and the processing pipeline for the diagram recognition system

- Collect data and perform labeling tasks on the data

- Implement the recognition model, which uses both the DL approach and traditional Computer vision algorithms in the pipeline

- Implement the mobile application

- Evaluating the application and performance of the proposed system

Trang 4

VT姶云PI"A萎K"J窺E"DèEJ"MJQC E浦PI"JñC"ZÊ"J浦K"EJ曳"PIJ C"XK烏V"PCO

KHOA KH & KT MÁY TÍNH A瓜e"n壱r"- V詠"fq"- J衣pj"rj¿e

40"A隠"v k<" Z¤{"f詠pi"泳pi"f映pi"pj壱p"fk羽p"n逢w"8欝"u穎"f映pi"e ej"vk院r"e壱p"e栄c"vj鵜"ik e"o {"v pj

(Building A Diagram Recognition Application with Computer Vision Approach)

- The students proposed a solution for the diagram recognition problem, which utilizes the

advantages of computer vision techniques and machine learning approaches In addition, the students have successfully built a mobile application that allows users to interact with the system easier The application was built with useful features and easy to use interface

- The students also did a lot of evaluation as well as proposed some improvement in the

recognition algorithm

90"Pj英pi"vjk院w"u„v"ej pj"e栄c"NXVP<

- Some algorithms used in the project are not so advanced and may not be able to handle

some difficult cases in the problem

- The evaluation results are promising but still need to improve further, especially when

investigating various cases in recognition

:0"A隠"pij鵜<"A逢嬰e"d違q"x羽"฀ D鰻"uwpi"vj‒o"8吋"d違q"x羽" Mj»pi"8逢嬰e"d違q"x羽"

Trang 5

40"A隠"v k< Building A Diagram Recognition Application with Computer Vision Approach

- The thesis presents a system that can convert handwritten flowcharts into digital documents

- The system is built quite full of features and has a good application demo

- This thesis has quite a large amount of work including recognizing shapes, handwriting, arrows and building demo app

- The thesis has experiments and is quite fully cited This thesis also presents quite detailed algorithms and models

90"Pj英pi"vjk院w"u„v"ej pj"e栄c"NXVP<

- This application requires many techniques combined, leading to a lot of work in many technique areas This is also one of the weaknesses of the thesis when the research works on the topic have not been strongly developed For example, the handwriting entry The team can focus on developing a few key techniques instead of all of them, the rest can use existing results

- The evaluation parameters are not detailed and user-oriented, for example, is the assessment of the arrow considered fair for all arrow types?

- The data used to train the model is not clearly presented The application should explore more about usability, adapting to the user, instead of just focusing on general accuracy

- Models should be analyzed in more detail, rather than just using it

8 A隠"pij鵜<"A逢嬰e"d違q"x羽" D鰻"uwpi"vj‒o"8吋"d違q"x羽" Mj»pi"8逢嬰e"d違q"x羽"

c What is your next research priority?

320"A pj"ik "ejwpi"*d茨pi"ej英<"ik臼k."mj "VD+< Gi臼k Ak吋o"<"""""8.7 /10

M#"v‒p"*ijk"t "j丑"v‒p+

Trang 6

We hereby undertake that this is our own research project under the guidance of Dr.Nguyen Research content and results are truthful and have never been published before.The data used for the analysis and comments are collected by us from many differentsources and will be clearly stated in the references

Additionally, a number of reviews and figures of other authors and organizations we usewill have citations and origins clearly stated in the report

If we detect any fraud, we take full responsibility for the content of our graduation thesis

Ho Chi Minh City University of Technology is not related to the copyright and copyrightinfringement caused by us in the implementation process

Nguyen Quang SangHuynh Tan Thanh

Trang 7

Nguyen Quang SangHuynh Tan Thanh

Trang 8

Graphical language has been and is always one of the most effective tools for demonstratingideas to others Besides text and images, a flow chart plays a vital role in providing people aclearer view of a plan, or a process with simple symbols, notations Nowadays, many meetingsstill enjoy the traditional way by using board, paper to draw diagrams expressing their thoughts

on the topics discussed A problem occurs when saving these drawings as a reference for futurepurposes since we cannot edit the diagram taken from the picture These drawn pictures need to

be re-drawn by some tools to be suitable in professional documents In addition, the re-drawntool can be a computer or a particular device like electronics drawing boards and digital pens,which cost a lot and is not the most convenient tools to use

Therefore, a new approach is necessary to convert hand-drawing charts pictures into digitalones The approach can help us avoid re-drawn tasks, simplify the sharing process betweenusers, and be able to export them into another form like picture files (png, jpg), document files(pdf), or standard diagram editing files (drawio) The application must be able to run on popularplatforms and accessible to everyone

Trang 9

2.1.1 Object recognition 4

2.1.2 Diagram tools 4

2.1.3 Diagram recognition applications on mobile devices 5

2.2 Diagram recognition 5

2.3 Handwriting Text recognition 7

2.3.1 Preprocessing Phase 7

2.3.2 Recognition Phase 9

3 Background 12 3.1 Faster R-CNN 12

3.1.1 Backbone CNN 12

3.1.2 Regional Proposal Network 13

3.1.3 Non-Maximum Suppression 14

3.1.4 Region of Interest Pooling (RoI Pooling) 16

3.2 Mask R-CNN 16

3.2.1 Object Mask (Binary Mask) 17

3.2.2 Feature Pyramid Network 18

3.2.3 Region of Interest Align (RoI Align) 18

3.3 Handwriting Text Recognition 19

3.3.1 Long Short Term Memory (LSTM) 19

3.3.2 Gated Recurrent Unit (GRU) 20

3.3.3 Bidirectional RNN (BRNN) 21

3.3.4 Connectionist Temporal Classification (CTC) 22

4 Proposed model 25 4.1 Diagram Recognition Approach 25

4.1.1 Preparing diagram dataset 25

4.1.2 Recognition model 28

4.1.2.1 Feature map generator 28

4.1.2.2 Proposal generator 29

4.1.2.3 Instance generator 31

4.1.3 Diagram building 32

4.1.4 Symbol-Arrow relationship 33

Trang 10

4.1.5 The relationship of text 36

4.2 Handwriting Text Recognition Approach 36

4.3 Digital diagram output format 37

5 System design 40 5.1 Requirements 40

5.1.1 Functional requirement 40

5.1.2 Nonfunctional requirement 41

5.1.3 Hardware requirement 41

5.2 System Architecture 42

5.3 Framework 42

5.3.1 Flutter 42

5.3.2 Nodejs 43

5.4 Database Design 44

5.4.1 Diagram File Design 44

5.5 Feature design 45

5.5.1 Usecase Design 45

5.5.2 Login/Register Screen 46

5.5.3 Diagram List 47

5.5.4 Create diagram 47

5.5.4.1 Diagram Scanning 47

5.5.4.2 Create from blank 48

5.5.5 Diagram Editing 49

5.5.6 Exporting 53

5.5.6.1 Converting to drawio files 54

5.5.7 Member Management 54

5.5.8 Version and History 54

6 Experiments 55 6.1 Initial experiments 55

6.1.1 Preprocessing 55

6.1.2 Recognition 56

6.2 Experiments on the recognition pipeline 58

6.2.1 Perform training and evaluation on HTR model 58

6.2.2 Perform training and evaluation on diagram recognition model 59

6.2.3 Perform experiments on the combination of diagram recognition model and HTR model 59

6.3 Display diagram on device 63

6.3.1 Interactive Viewer and Matrix4 63

6.3.2 Rendering diagram recognition on device 63

7 Conclusion and Future Work 68 7.1 Conclusion 68

7.2 Challenges 68

7.3 Future work 69

Trang 11

B.3 Editing Screen 91

B.3.1 Vertex List 91

B.3.2 Zoom View 92

B.3.3 Edit Option 93

B.4 Diagram history 96

C Testing 97 C.1 Login and register 97

C.2 Home screen options 98

C.3 Editing 100

Trang 12

List of Tables

4.1 Statistics of DIDI images[26] 26

6.1 Number of symbols in dataset 58

6.2 Measure Arrow Average Precision 58

6.3 Evaluation summary of the two models 61

A.1 Usecase List 70

A.2 Usecase: Login 71

A.3 Usecase: Sign up 72

A.4 Usecase: Create new diagram 73

A.5 Usecase: Scan diagram with camera 74

A.6 Usecase: Scan from image 75

A.7 Usecase: Preview Diagram 76

A.8 Usecase: Export file 77

A.9 Usecase: Modify diagram 78

A.10 Usecase: Delete Diagram 79

A.11 Set permission 80

A.12 View version history 81

A.13 Comment 81

A.14 Usecase: Logout 82

Trang 13

3.5 Mask R-CNN architecture[51] 17

3.6 Binary mask sample in diagram recognition 17

3.7 Feature Pyramid Network[53] 18

3.8 Region of Interest Align [54] 19

3.9 Long Short Term Memory [56] 20

3.10 Gated Recurrent Unit [57] 21

3.11 Bidirectional RNN [58] 21

3.12 Horizontal position of characters [60] 22

3.13 Character-score Matrix [60], the black lines presents the path to get character "a" ("aa", "a-" and "-a"), while the dash line presents the character "" ("–") 23

4.1 DIDI sample 26

4.2 A original sample of FC dataset (left) and preprocessed result (right) 27

4.3 Our drawn image (left) and the preprocessed result (right) 27

4.4 Our pipeline 28

4.5 Feature Pyramid Network with ResNet[63] 29

4.6 A prediction of our model 32

4.7 Example of Eucludian Distance not working 33

4.8 Preprocessed samples 37

4.9 Line segmentation sample 37

4.10 Diagram recognized image 38

4.11 Model JSON output 39

5.1 System Architecture Design 41

5.2 Database Design 43

5.3 Diagram JSON file design 45

5.4 Usecase Design 46

5.5 Login/Register Screen 47

5.6 Home Screen 48

5.7 Scanning sequence 49

5.8 Scanning phrase 50

5.9 Flowchart symbols 51

5.10 Edit Screen 53

Trang 14

6.1 Testing pictures 56

6.2 Experiment results of figure 6.1a (a) The warped image of figure 6.1a after applying perspective transformation and grayscale conversion; (b) The binary image converted from (a) 57

6.3 Experiment results of figure 6.1b (a) figure 6.1b after resizing and grayscale conversion; (b) the binary image converted from figure (a) 57

6.4 Inference results from model training with DIDI dataset images only 58

6.5 Inference results from model training with new dataset 59

6.6 Loss and validate loss over epoch of HTR model 60

6.7 Loss over iterations of diagram recognition model 60

6.8 Inference results at above 0.6 score 61

6.9 (a) Normal text box and (b) Padded text box 61

6.10 Small boxes in sub function 62

6.11 Inference with problem drawings 62

6.12 Example of matrix4 64

6.13 Interactive space using identity matrix 64

6.14 Scaling in X and Y 65

6.15 Scaling in Z 65

6.16 Moving the space 66

6.17 Diagram display result 66

6.18 Rendering drawn diagram pictures on a mobile device 67

B.1 Login Screen 83

B.2 Register Screen 84

B.3 Diagram List 85

B.4 Turn off save a copy 86

B.5 Scanning 87

B.6 Export Sheet 88

B.7 Management Sheet 89

B.9 Add vertex 91

B.11 Zoom options 92

Trang 15

Diagram has quickly risen to become one of the efficient communication method that hasreplaced text for demonstrating certain types of information such as algorithms, business pro-cess models, and production structure The ideas proposed that visualized by diagrams are moreclear than any word can do, which helps viewers easily comprehend the key ideas, how it works,and so on Additionally, people tend to process information visually and be able to remembergraphical information more readily than anything we read The powerful effects of diagrams areable to be seen in many common events which we often attend, presentations It will be a night-mare for the audience if a presentation only uses words, numbers to describe the knowledge.The inability to absorb the raw knowledge in a limited time will lead to most of them are leakedand the failure of the presentation is inevitable On the other hand, with informative diagrams

or pictures, their presentation will be more catchy and comprehensive, thus helps audiencesunderstand the illustrated ideas faster comparing with texts

Nowadays, due to the benefits of diagrams, various of services are created to serve the pose of creating diagrams with diverse types and a wide range of supported platforms such asweb, desktop, and mobile One of the most popular is the Lucidchart website, draw.io website,DrawExpress Diagram Lite for android, etc However, an idea or plan rarely to be created onthese applications at the beginning They are usually sketched on paper or whiteboard in meet-ings These initial conceptions are crucial for building bigger, more complicated designs and

pur-on a greater scale So, in order to make it global or just simply to share them with everypur-one in

a group, they need to be digitized Organizations and companies usually have people redrawnthese ideas on the computer and export them to editable files, yet this can raise a lot of potentialproblems First of all, this is a waste of time and resources The design on paper needs to bereplicated exactly to preserve the original proposal In addition, there is still a limited optionwhen it takes to save these sketches and transform them into digital form for storing and refer-encing in the future Not to mention that sometimes due to technical problems, these jobs cannot

be done by the creator, the one who understand the design most, and it can lead to informationloss or misconception

Realizing the need for a product that saving these raw drawings, we decided to build anapplication that is able to convert hand-drawn ideas into digital form This app should also haveother diagram-based services for users such as sharing diagrams across people in a group, editand modify the digital form of the sketch diagram, etc For this product, in this proposal, weneed to carry out some surveys in the area of object detection and the existing applicationsrelated to the diagram for the services which our project can offer to users

Trang 16

In the field of object detection, there are many approaches developed for various types ofobject These approaches are categorized into two types which are two-stage and one-stage Thetwo-stage detectors, such as Faster R-CNN and Mask R-CNN, which have the Region proposalnetwork (RPN) to generate the regions of interests (RoI) in the first state, then use these regionsfor object classification and bounding-box regression The accuracy of these approaches reachesthe highest rate, but these approaches are slow on the other hand, the approaches considered asthe one-stage like YOLO (You Only Look One) and SSD (Single Shot Multi Detector), which

do not have the stage generating the bounding box, only classify the objects and estimate theirsize with bounding-box in one go These approaches are faster than the two-stage, but theiraccuracy is lower For the diagram recognition, there are many small symbols, such as arrow,which needs a high accuracy approaches to detect than a faster one In the two-stage, we findout that there are some attempts with the detection of diagrams, especially flowcharts Theseresearches clarified the flowcharts into two main groups: online and offline recognition Inonline recognition, the diagram is drawn as a sequence of strokes using a device with a inkinput device with digital pen The approach for this group is often in the form of RNN For theonline recognition, many researches focused on this field such as [1, 2, 3, 4, 5, 6, 7, 8, 9] For theoffline recognition, the target used for detection is handwritten diagram from an image captured

by the camera However, there are less the attempts [10, 11] for this group until 2019, 2020 Inthese years, a paper introduced about a new approach called Arrow R-CNN [12] The ArrowR-CNN is designed for the offline diagram recognition, whose targets include detection of text,non-arrow symbols like process, decision and the arrows symbols for the relations between thenon-arrow ones The approach of the paper built its system with the Faster R-CNN as the base

By using the Feature Pyramid Network (FPN), it also handles the limitation of the Faster CNN, which is the datasets containing objects have a large-scale variance For the arrows, thepaper uses the key-points prediction, which is often used to detect human’s pose from otherfield of object detection, to deal with the diagram arrow However, the approach only detectsthe regions of text in diagram, which we also needs to find a handwritten character recognitionapproach to digitize the text in the diagram These approaches are going to be described moreclearly in Chapter 2 and we dive deeply about the detail in the layers of these approaches forbetter understanding in Chapter 3

R-1.2 Project Goals

This project’s main target is to build a diagram recognition system consisting of a computervision model that can convert the hand-drawn flowchart from a source of the image into a digitalform, and the other is an application service that utilizes this model to serve real-life projects.The service should include these features and restrictions:

• Develop a system that includes a server to process and run recognition models and anAndroid application using the Flutter framework for users to scan and modify diagrams

• Users will need to have an internet connection and a system account in order to access anyfunction of the service The app is best used in vertical mode

• The system only supports creating and editing flowcharts in this thesis scope The mobileapplication can be used to take picture of the diagram, scan it and create digital versions.Users can modify the diagram with provided tools

• The system only supports converting and editing small charts (A4 page with medium sizetext is recommended, drawn with a ball pen on fresh white paper) and the flowchart con-tains a maximum of 20 symbols, not including arrows The flowchart must not have any

Trang 17

each role must have specific permissions and is only allowed to perform certain actions.

• Each time a new changes is saved, it will be uploaded as a new version Users can thenreview the previous version and discuss within the app

• The camera of the device should be in good condition and the surface on which the diagram

is drawn should be clean, flat, and distinguishable with the background

• All of the required permission on the app should be granted to have the best experience

• The device used to run this application should have adequate performance and suitablehardware (this will be discussed in detail in chapter 5)

The report is organized as follows Chapter 2 briefly surveys application that can detectobject and related work in diagram detection in general and flowchart detection in particular.Chapter 3 provides sufficient knowledge in order to implement the project and understand therelated work Chapter 5 shows our proposed system, including how the application works andshows the implementation of the application and server Chapter 6 lists our experiment and im-plementation of the system Finally, Chapter 7 shows our challenge and potential future in thethesis

Trang 18

to detect object bounding boxes and Convolutional Neural Network (CNN) with a Long term memory (LSTM) to recognize text in images Another example is Microsoft Math Solver[14], released by Microsoft in December 2019 This application is designed with the serviceusing optical character recognition (OCR) approaches to read an image of a handwritten mathproblem from students and solve the problem of typing complex formulas or expressions Theapplication’s main services focus on using OCR and Natural language processing techniques

short-to categorize letters, math symbols, and characters There are also other applications that candetect certain types of objects such as Aipoly Vision [15] for recognizing object and color tohelp the blinds, visually impaired, and color blind; and Vivino [16] for providing informationrelated to wine from the image source

2.1.2 Diagram tools

One of the most popular tools that use to create diagrams and diagram-related productsare Lucidchart [17] and draw.io [18] These applications provide a multi-platform service tocreate many forms of digital charts for users to express their ideas as freely as possible Userscan also share the work with ease on many accounts, many platforms, or even many devices.Ultimately, real-time collaboration can connect these users to break the limit of distance andtime Users are not limited to one or two kinds of content they can access These tools give a

Trang 19

As for specific devices like mobile phones, smartwatches, or smart bands, their ity is so enormous that the company that can bring their services into these little inseparabletechnologies have a huge advantage in approaching customers, and using the service will soonbecome a user’s daily routine, which can bring great benefit for the company This is also truefor diagram services on mobile like Lekh Diagram [19] and DrawExpress [20] Imagine having

popular-a whitebopopular-ard in the pocket thpopular-at is repopular-ady every time popular-a new idepopular-a pops up in the user’s mind ever, the worst deal-breaker for these services is the accuracy, and the ability to interact withthe device is limited Most phone users use their fingers to touch, slide and scroll on the screen

How-to communicate with the devices These actions may work well on a bigger screen and with apen where it can freely interact and not be bounded by the screen area or the precision of thepen tip However, on a much smaller and much bigger fingertip, any simple task like drawing

or choosing a slightly small object can become impossible and bring poor user experiences.Not to mention, many people are used to writing on paper using a pen instead of on the phone.All of these drawbacks explain why there are not many applications designed for professionallydesigning on mobile platforms

So why the services mentioned earlier are still working well Most of them allow users

to simplify the amount of work they need to visualize their ideas The most straightforwardactions on mobile devices (touch, slide, rotate, drag and drop) are now being used to enhanceand unlock users from being cramped with traditional computer interaction Drag and dropbecome the most popular actions used in designed applications instead of selecting the locationand adjusting Furthermore, Lekh Diagram and DrawExpress also allow users to sketch a shapeand beautify it with artificial intelligence to recognize the form and tell the system to change itinto the correct one in real-time The user can draw a not-perfect triangle connected by a messysquare and still get the most precise result they want, thanks to the real-time recognition feature.However, most of the existed systems or services right now only focus on real-time recogni-tion That means users still need to input the diagram themselves even though they have alreadydrawn it somewhere else In addition, some people may find working on an actual paper orwhiteboard makes their work much more convenient It would be a bad experiment to do thesame job again to digitalize your work

We realized that many people prefer to work on paper using pen and traditional tools tocaving their work, yet still need to modernize the workflow by digitalizing the result, putting itonline, and sharing it with others This system is designed to capture the hand-drawing diagramwith the device’s camera and put it into the device, which you can bring along, edit, share anddiscuss with the team

2.2 Diagram recognition

As we mentioned in chapter 1, there are two types of diagram recognition which are line Diagram Recognition and Offline Diagram Recognition There was more attention to theOnline Diagram Recognition to handle the diagram handwritten drawn by digital ink device

Trang 20

On-RELATED WORKS

in the past Firstly, Valois et al (ICDAR2001)[3] proposed a solution about online ing sketched electrical diagrams The proposed system tried to decompose the ink strokes intoprimitive components (lines or arcs) Then, the system checks whether it can merge these prim-itives and their neighbors into a higher-level component Each set of relations predefined forthe primitives is recognized as matching the confidence factor using probabilistic normaliza-tion functions Its downsides are the system’s simplicity, and their low accuracy leads it not

recogniz-to be a suitable approach in real-life situations Feng et al (j.patcog2009)[4] proposed a moremodern technique in recognizing electrical circuits Symbol hypotheses generation and classi-fication are generated using a Hidden Markov Model (HMM) and traced on 2D-DP However,when dealing with a large diagram or a huge number of hypotheses, it becomes slow Thus,

it is also not considered as an approach that we can use in our project ChemInk by Tom andRandall (IUI2011)[21], a system for detecting chemical formula sketches, categorizing strokesinto elements and bonds between them The final joint is performed using conditional randomfields (CRF), which combines features from a three-layer hierarchy: ink points, segments, andcandidate symbols Qi et al (CVPR2005)[22] applies a similar approach to recognize diagramstructure with Bayesian CRF - ARD These methods outperform traditional techniques, but inthe final step of recognition, they used pairwise for joining the features, causing them to beharder for future adaptations In addition, these approaches only focused on the symbols; theydid not mention text in the diagram, while there are many words, letters presented in the diagram

in real-life situations

After Awal et al (SPIE2011) released the Online Handwritten Flowchart Dataset (OHFCD)[23], many researchers concentrated on a new target, the flowchart involving this dataset Thenext approaches got an improvement that is they also mentioned text and proposed some meth-ods to classify text and non-text symbols Lemaitre et al.(2013)[5] proposed DMOS (Descrip-tion and MOdification of the Segmentation) for online flowchart recognition The work of Wang

et al (ICFHR2016)[6] used a max-margin Markov Random Field to perform segmentation andrecognition In paper of Wang et al (IJDAR2017)[7], they extend their work by adding a gram-matical description that combines the labeled isolated strokes while ensuring global consistency

of the recognition Bresler et al (ICDAR2013)[8] proposed a pipeline model, where they rate strokes and text by using a text/non-text classifier Then, they detect symbol candidates byusing a max-sum model by a group of temporally and spatially close strokes The author alsoproposed an offline extension that uses a preprocessing model to reconstruct the strokes fromflowchart [9] While online flowchart recognition detects candidates based on strokes, offlineflowchart recognition recognizes the targets from the image source Bresler also gave some at-tempts in the offline flowchart recognition; he provided a preprocessing stage to reconstructonline stroke from offline data [10] However, that preprocessing step is waste-time because

sepa-we can recognize the whole diagram structure independently with strokes As online tion seem to attract more researchers, there have not been many studies about offline detection.Julca-Aguilar and Hirata proposed a method using Faster R-CNN to detect candidates and eval-uate its accuracy on OHFCD in [24] Using this approach, they need to convert the online data

recogni-to offline, which we can also consider as an offline approach The model can detect components

in the diagram, including arrows, but it cannot detect the arrowhead Until late 2019, early 2020,there is a new attempt researching offline recognition for the flowchart The paper introduces anew model called Arrow R-CNN [12] which improves the version of Faster R-CNN Faster R-CNN has a limitation when it works with datasets where objects have a large-scale variance Tohandle this problem, the author added Feature Pyramid Network in the backbone of the model

By this approach, the backbone will generate a pyramid of feature maps at different scales Theimage feature pyramid is a multi-scale feature representation in which all levels are semanti-

Trang 21

agrams The approach using Weighted Euclidean Distance is essential to take into account thedirection of the keypoints vector of each arrow However, the method of connecting arrows witharrows, which he used to handle the problem caused by the arrows intersection, is not usablebecause this problem increases the number of arrows, making the model slow when applyingthe algorithm with a large number of arrows Therefore, we decide to reuse some approaches inhis work and we try to enrich the knowledge of the model.

The resources for training and evaluating an offline diagram recognition model are still notmany However, there is a large amount of data in the online diagram recognition field, as wementioned above We can convert these datasets for our purpose, so we need to research theseonline datasets Firstly, the OHFCD contains five symbols: terminal, input/output, connector,arrow, and decision However, we have some problems when we try to access this data source atfirst Therefore, we need to choose the Digital Ink Diagram dataset (DIDI) [26] However, thisdataset has fewer symbols than the OHFCD, it does not have the connector symbol, and it has

an octagon symbol, which is not related to the flowchart Despite its downside, we can easilyaccess it, and it contains a large number of images Because these are online diagram datasets,

we have to label them again for our model to work by the tool called labelme [27]

2.3 Handwriting Text recognition

Handwriting Text Recognition (HTR) is a domain in image processing and pattern nition which has become attractive and complicated in the current day due to its applications.HTR systems’ primary purpose is to transcribe the handwriting characters, words into the digi-tal form for people to electronically amend, store and search more efficiently and correctly Thehandwriting text can appear in many document sources such as invoices, information forms,historical manuscripts, etc We generally store these sources as images for processing, which wecan group into offline information At the same time, HTR can also transcribe the online infor-mation type, which contains the strokes of the text written using electronic devices We concen-trate on the offline recognition method involving dealing with text image data The procedureconsists of two main phases: pre-processing image, and handwriting character recognition

recog-2.3.1 Preprocessing Phase

Firstly, pre-processing is a necessary procedure applied on images before becoming inputs

of the HTR model, which has a critical impact on the performance of the model’s output Thisprocess is a chain of actions to improve the image’s quality and normalize pictures appropriatelyfor feature extraction The pipeline of pre-processing often includes scanned input image, noiseremoving, binarization, edge detection, dilation and filling, processed image for feature extrac-tion [28] However, we change the pipeline into noise removing and normalization and processimage for feature extraction because the binarization step can also appear in noise removing

Trang 22

docu-Next is image normalization, and in this part, we focus on the text in the image There are twopopular techniques for this step called skewer correction and slant correction In a handwrittendocument, the words are not always horizontally aligned, making skewer correction come inhandy The process can apply the align modification on document or line level The other isslant correction or deslanting, which transforms the cursive style of text written into uprightform The technique involves a shear transformation and pixel shifting by a distance d or -

d There are many attempts to remove the slant in the texture image First is the algorithmprovided by Bozinovic and Srihari, which needs two tuning parameters Max Run (MR) andStrip Height (SH) The technique removes the rows containing a continuous stroke higher thanthe parameter (MR) After that, the process discards horizontal strips formed from the remaininglines if their height is lower than the value of SH This part of the algorithm is a disadvantagebecause parameter tuning requires many experiments to select proper values for specific imagedata Another deslanting approach proposed by Guillevic and Suen [30] is based on the slantedhistogram The approach finds the average slant by looking for the most significant positivevalue of the first derivative in all the slanted histograms computed This approach is simple, and

it performs relatively well if we can make use of parallelism in calculating the slanted histogramsfor each rotate angle However, the approach observing the highest stroke at the demonstration

of the whole image is a downside of the algorithm

We find the algorithm in the work of A Vinciarelli and J Luettin [31] more interesting Themethod follows the hypothesis that the image is slanted when it contains the most significantnumber of columns with a continuous stroke The approach finds the slanted version of theimage for each angleα It determines which column contains a continuous stroke by countingthe foreground pixels in each column (the vertical density) Then, the process selects the slantedversion by taking the sum of the square root of the vertical density of each column By takingthe square root for the vertical density, the column containing the longer stroke contributesmore than the one with the shorter stroke in the selection process This algorithm observes thecharacteristics of the image through all the columns containing the continuous stroke, which

is an upside compared to the algorithm of Guillevic and Suen (only a single stroke as imagepresentation) The computation in the method is also straightforward because the two steps onlyinvolve counting the pixels in each column Although the researchers mentioned that calculatingeach shear transformation for each angle is heavy, we find that it is acceptable with the help of

Trang 23

tures from the input image, and the classification, which receives the feature sequence to match

it with the character sequence Hidden Markov Models (HMM) is one of the most popular tions for this problem However, the method only predicts the future state depending on the cur-rent state, while the past state does not have any impact on the prediction, making HMM not uti-lize the information in a long character sequence In the last few years, a more modern approach,Convolution Recurrent Neural Network (CRNN), has played an essential role in the work ofHTR The CRNN consists of the backbone CNN and a sequence decoder RNN The RNN iscommonly Long Short-Term Memory (LSTM), while another type of RNN is called Gated Re-current Unit There are many improvements for these RNNs For example, MultidimensionalLSTM (MLSTM) helps RNN architecture work with multidimensional data However, thismethod is complex and has high computation cost, so a new, more uncomplicated technique,Bidirectional LSTM (GRU), is proposed The Bidirectional RNN offers nearly the same perfor-mance as the Multidimensional There are some models, such as CNN-BLSTM by Puigcerver

solu-et al (ICDAR2017)[32] and Gated-CNN-BLSTM by Bluche solu-et al (ICDAR2017)[33] giving agood performance in [34] However, optical models often contain millions of trainable param-eters to reach high performance but are heavy for further applications, while the Gated CNNmethod, having a smaller number of these parameters, reduces its performance Because of thehigh computation cost of the diagram model, we chose the approach for the HTR system byNeto et al (SIBGRAPI2020) [35]

The paper proposes a lightweight model with thousands of trainable parameters and provements compared to the two mentioned, its architecture in figure 2.1 The model followsthe CRNN architecture, including the Gated CNN for the backbone and BGRU for the RNNdecoder However, the Gated CNN approach is from Dauphin et al [36], which performs apointwise multiplication between two half of input features textb f h1 with sigmoid activationand textb f h2 without activation instead of taking the whole input into the sigmoid function,then performing the operation between the input and output of sigmoid function For the re-current block, the model uses the BGRU instead of the BLSTM These two layers are from theGRU and LSTM, which are born to solve the problem of vanishing and exploding gradients.Despite that, the two RNNs have their structures and gating mechanisms, which distinguishtheir characteristics The architecture of GRU is not complex as LSTM, and GRU has fewerparameters than LSTM Additionally, GRU performs the forget and choose memory with onestate (the update gate), while LSTM requires more gates for the same task These are the reasonsmaking GRU lighter and faster than LSTM However, in other scenarios, in which performance

im-is more important than speed, LSTM can surpass GRU, especially dealing with short sequencesand large datasets At the same time, these two perform nearly the same with long sequences[37] Therefore, it depends on the situation of the problem, which impacts the choices of GRUand LSTM In our case, GRU is a better choice to keep the OCR system lightweight and fastwhen we connect it to our diagram model, which requires heavier computation

The datasets used for evaluating the chosen HTR model are the Bentham database, IAMdatabase, RIMES, Washington, and Saint Gall [38, 39, 40, 41, 42] They appear in many lan-

Trang 24

RELATED WORKS

Figure 2.1: Flor-HTR architecture [35]

guages, such as Bentham, IAM, Washington in English, RIMES in French, and Saint Gall inLatin Our target is English characters recognition, which makes us choose Bentham, IAM,and Washington However, the Washington dataset has only 656 images, while the Benthamdatabase has 11470 images, and the IAM database has 8922 images (these images contain a sin-gle text line only) Therefore, Bentham and IAM are enough for our project Additionally, thesetwo datasets include texts with free cursive style, while the captured condition of Bentham isworse than IAM’s condition, which is why we chose the two datasets From the HTR paper,

we are going to apply the deslanting algorithm mentioned on these two datasets to soften thecursive style of characters and use Illumination Compensation [43] with the Bentham dataset tobalance brightness and light contrast due to its condition

Trang 25

(a) Bentham dataset sample

(b) IAM dataset sample

Figure 2.2: Dataset samples

Trang 26

Chapter 3

Background

In this section, we provide the basic knowledge of the techniques we can use in our project.These basic knowledge related to the survey will helps us summarize the main information,characteristics of each technique and model, which are related to the knowledge in chapter 2

com-3.1.1 Backbone CNN

The Backbone CNN is the first part of the Faster R-CNN model, which images go through

It plays a feature extractor role which encodes the image input and then presents it as a featuremap The better the CNN is, the higher accuracy the model reaches We choose the ResNet101for the backbone, which is also in the paper [12] ResNet101 is one of CNN architecture from theResidual Network family [44], which has been popular recently Most of ResNet’s structure isalmost the same; they distinguish each other by the number of blocks in their bodies Therefore,

we can use ResNet50 in figure 3.1 to describe ResNet101, due to ResNet101 is very large Theseblocks exist in two types: convolution block and identity block These two have nearly the samestructures containing two paths, the main path and the shortcut path The former contains manyconvolution layers, which are:

1 Conv layer, kernel size 1×1 with Batch Normalization (BatchNorm) and Rectified LinearUnit (ReLU)

2 Conv layer, kernel size k×k with BatchNorm and ReLU In figure 3.1, the kernel size 3×3with k = 3

3 Conv layer, kernel size 1×1 with BatchNorm

The shortcut way is different in each type of block This path has one kernel 1×1 conv layerwith BatchNorm in the convolution block, while the identity block keeps the original input Theresults of the two paths are then added together and go through the ReLU activation function

Trang 27

Figure 3.1: ResNet50 architecture [45]

In general, when the image goes through the ResNet, it needs to be resized to 224×224with three channels of color (RGB) Then, the input gets into a conv layer 7×7 with Batch-Norm and ReLU, followed by a MaxPool 3×3 The output of the MaxPool layer continues with

a combination of convolution block and identity block Finally, the model applies the globalaverage pool and feeds the results to the softmax function to receive the feature map For eachtype of ResNet, the number behind its name means the number of layers, which we calculatewith 1+ 4 × 3 + 12 × 3 + 1 = 50 for ResNet50, because it contains 4 convolution blocks and 12identity blocks, which has 3 layers (the layer in the shortcut path is redundant) Similarly, theResNet101 has 4 convolution blocks and 29 identity blocks, which makes it 101 layers

3.1.2 Regional Proposal Network

Regional Proposal Network (RPN) is a sub-model whose input is the feature map fromthe backbone CNN RPN’s objective is to generate box proposals indicating whether this is anobject In addition, the model also refines the boxes in its regressor to close to the foreground

We can separate the training process of RPN into three stages: In the first stage, the modelreceives the feature maps and the ground-truth boxes The model uses a sliding window toscan the feature maps For each position of the window, a set of m×n anchors are generated,which all have the same center point with the sliding window but with m different aspect ratiosand n different scales (Figure 3.3 presents m, n equal 3) Each anchor A has four parameters

< xa, ya, wa, ha> demonstrating the center’s coordinate (x,y), width and height

In the next stage, involving labeling the anchors, the model calculates the value Intersectionover Union (IOU) between the anchors and the ground-truth boxes IOU illustrates the over-lapping of ratio of the anchors and the ground-truth boxes Then, the model compares the IOUvalues with the two threshold of foreground and background to determine the label of the anchor

k= Anchor∩ GroundTruthBoxAnchor∪ GroundTruthBoxRegularly, the thresholds for background and foreground are 0.3 and 0.7, respectively Themodel ignores all the anchors’ labels, which are -1 After that, the model creates a vector of

Trang 28

Figure 3.2: Anchor Box in RPN [46]

labels from each anchor box The relationship of an anchor with ground-truth boxes presents

in labels of the vector The final phase of the process is proposal generation For each anchor,

we need to find its bounding box and class (foreground or not) The regression task producesthe four attributes of the box, while the classification task predicts the label vector of the box.Next, the model calculates the loss function from these five outputs Thus, the loss function ofRPN consists of box regression loss and label classification loss For regression loss, we need

For the classification loss, we can use Binary Cross-Entropy function with the predictionlabel and anchor label like the below formula:

Lr pn_cls= BCE(labelpredict, labelgroundtruth) (3.2)From the two above steps, The computation of the RPN’s total loss is done by the sum of theregression and classification loss values for every anchor box divided by the normalization term

3.1.3 Non-Maximum Suppression

After we have the box proposals from the RPN model, the next task is to filter out theexcessive boxes There are generally many bounding proposals around each object, but we onlyneed one box to present an object Non-Maximum Suppression (NMS) is a powerful techniquefor this task Faster R-CNN uses this technique in both phases, training and prediction Thisfiltering work helps the model reduce the number of boxes processed, which are unnecessary,and improves the computation cost, especially in the prediction stage Figure 3.7 presents an

Trang 29

Figure 3.3: Result of a Non-Maximum Suppression application[47]

example of the output from an application of NMS There are three boxes in the input image,which bound only an object, the car After applying the NMS technique, the car only has onebounding box remaining, with the highest score Reducing the number of boxes also reduces themodel’s computation cost in the following task of the model The algorithm 1 briefly illustratesthe objective of the NMS technique Its inputs are there parameters B, S, and tnms, which arethe list of the proposal boxes, boxes’ scores, and the IoU threshold value, respectively Thealgorithm returns the filtered list Bnmsof proposal boxes, and S, the original scores list, whichhas only the scores of the boxes in Bnmsafter filtering

Algorithm 1: Non-Maximum Suppression[47]

Input: B= [b1, b2, , bn], S = [s1, s2, , sn],tnms

Output: Bnms, S

B, Bnms: list of initial boxes, filtered list of boxes

S: list containing scores

Trang 30

Figure 3.4: Region of Interest Pooling[48]

Figure above shows the first quantization performed on a floating-number RoI causes lost data (bluepart); figure below shows the first quantization performed on a quantized RoI when extracting feature

map 3x3 from feature map 4x6 which also loses data in the white part

3.1.4 Region of Interest Pooling (RoI Pooling)

Region of Interest Pooling (RoI Pooling) is a traditional approach for extracting a smallfeature map (e.g., 3x3) from each RoI, introduced in the Fast R-CNN paper In RoI Pooling,quantization is performed, which the technique applies to convert the floating-number RoI into aquantized RoI having discrete sizes and coordinates (in integer) Firstly, the proposals generatedare based on the original input image size, so we need to rescale them to the feature map size,which applies quantization Figure 3.4 shows an example of this task with the input image ofsize 512× 512 × 3 and feature map of size 16 × 16 × 512 The technique rescales the 145 × 200proposal to the size of the feature map by dividing the height width by 32 (512/16) and roundingdown the float results to get integer values This causes losing the data in the blue part of thefirst picture in figure 3.4 After getting the RoI for the feature map, we need to divide it intosmall bins according to the fixed size of the input into the following layer of RoI Pooling Ourcurrent figure’s target is to pool the RoI into the size of 3× 3 However, only the width equaling

6 can be divided by 3, while 4 cannot Therefore, the pooling method uses quantization again,causing data lost in the white square in second image of figure 3.4 Thus, the researchers find anew, improved method, called Region of Interest Align (RoI Align), handling the disadvantage

of the RoI Pooling technique We will introduce this method in the extension of Faster-RCNN,Mask R-CNN

Trang 31

Figure 3.5: Mask R-CNN architecture[51]

Figure 3.6: Binary mask sample in diagram recognition

a new predictor connects to the structure, which is responsible for objecting mask prediction.Additionally, an improved technique, RoI Align, also replaces the old RoI Pooling in the MaskR-CNN structure We will mention these new components in this section

3.2.1 Object Mask (Binary Mask)

The object mask is a new output in Mask R-CNN, which contains more information than

a regular object bounding box As we already know, the box output contains the coordinate

of a rectangle around an object, but the new object mask stores the pixel-to-pixel location ofthe object It is a 28× 28 matrix whose pixel’s value is true and false, or one and zero Fromthe pixel value, the model understands that if the pixel is true or one, the pixel is a part ofthe object Otherwise, it belongs to the background of the image or other objects The modelapplies the binary to the object, then resizes its pixels’ value to fit the target using either bilinear

or nearest neighbor algorithm Thus, the mask output of the Mask R-CNN produces is moreunderstandable for the computer to learn the object characteristics The binary mask also solvesthe instance segmentation challenge, which needs to identify each object with the same label.Figure 3.6 shows an example of binary masks, which include the grey mask for the terminalvertex and the yellow mask for text In training, the model computes the prediction loss of thebinary mask with ground-truth mask using binary cross-entropy loss function:

Lmask= BCE(maskpredict, maskgroundtruth) (3.3)

Trang 32

Figure 3.7: Feature Pyramid Network[53]

3.2.2 Feature Pyramid Network

Feature pyramids[52] is a component in recognition systems for detecting objects at differentscales It consists of two main components:

• Bottom-up Pathway: The bottom-up pathway is the feed-forward computation of the bone CNN It is defined that one pyramid level is for each stage The output of the last layer

back-of each stage will be used as the reference set back-of feature maps for enriching the top-downpathway by lateral connection Normally, one layer has its size equal to 1/2 of the previouslayer

• Top-down Pathway and Lateral Connection: The feature map at the highest layer in thebottom-up pathway is brought to the corresponding one of the Top-down pathway Forevery layer under it, a lateral connection is constructed by the following steps: The upperlayer is up by a factor of 2 The corresponding feature map from bottom-up is undergonethrough a 1x1 convolution to reduce dimension Finally, both feature maps from bottom

up and top down are merged by element-wise addition After the final layer is computed,

or the algorithm stop when reaching a certain level, all merged layers go through a 3x3Convolution layer to generate the final feature map

3.2.3 Region of Interest Align (RoI Align)

Mask R-CNN introduces a new algorithm called Region of Interest Align It replaces the RoIPooling in Faster R-CNN RoI Align deals with the round-off errors that RoI Pooling cannot.The differences are the bi-linear interpolation used when calculating the pixel’s value for thefloating point coordinate and no quantization performed

The step are processed as follow:

• The RoI Align layer divides the current RoI into small grids, based on the size of the featuremap need to be extracted For each small grid, there are K sampling points For the MaskR-CNN paper [49], they used four sampling points but they claimed that the results arenot sensitive to the exact sampling location, or the number of points The points divide thewidth and height of the grid equally

Trang 33

(a) Four sampling points (b) Bilinear interpolation of one point

Figure 3.8: Region of Interest Align [54]

• The approach applies bilinear interpolation to each sampling point with 4 center points ofnearby pixels in the feature map Repeat this process with other grids Figure 3.8 shows anexample of computing the value of one sampling point from four points:

1 We calculate the coordinate of the sampling point X, Y from the top left point of thegrid with coordinate of Xgrid= 9.25 and Ygrid= 6:

X = Xgrid+ (widthgrid) × 1 = 9.94

Y = Ygrid+ (heightgrid) × 1 = 6.5

2 We compute the value of the point with the four neighbor cells with the formula:

We can get P = 0.144 Similarly, we can compute the other points, and other grids

• After bilinear interpolation, we perform max pooling on these K sampling points to outputeach grid’s value Finally, we form the feature map from the result

3.3 Handwriting Text Recognition

3.3.1 Long Short Term Memory (LSTM)

Long Short Term Memory or LSTMs [55] are a special kind of RNN, introduced by iter and Schmidhuber, which can handle the problem of long-term dependencies The problem

Hochre-of long-term dependencies is the gap length between the current task and relevant informationrelated to the task If the gap becomes too large, normal RNN cannot learn to connect the pastinformation with its current task This is where LSTMs come to LSTMs main characteristic isinsensitivity to gap length which makes it suitable when dealing with sequence learning prob-lems that have time lags Figure 3.9 shows the structure in a LSTM cell

In a series of LSTM cells, the cell state runs across the entire chain and lets informationflow inside it The information running in the cell state passes through three gates These gateshave the ability to add or remove information from the cell state For the first gate (the firstsigmoid function), we can call it the "forget gate" The first gate gives the decision about whichinformation should be kept or removed from the previous cell state The input of this gate is theprevious hidden state (ht−1) and the current input (xt), while the output value of this layer varies

Trang 34

Figure 3.9: Long Short Term Memory [56]

between 0 and 1 with 0 means "completely remove the information" and 1 means "keeps all theinformation" Then, at the next gate, the information also goes through a tanh function to createcandidate value and a sigmoid function to decide which values need to be added to the cell state

or forgotten, called the "input gate" Finally, the last sigmoid,"output gate",is the output gate andhighlights which information should be going to the next hidden state (ht) However, LSTM is

a candidate for our HTR system, which can replaces GRU layer depending on the situation Wecan see the whole process from the following:

ft= σ (Ufht−1+Wfxt+ bf)

it= σ (Uiht−1+Wixt+ bz)

ot= σ (Uoht−1+Woxt+ bz)ˆ

3.3.2 Gated Recurrent Unit (GRU)

Another popular RNN layer is Gated Recurrent Unit (GRU), which is developed to handlethe vanishing or exploding gradient problem like LSTM The structure of GRU is not as complex

as LSTM, which we can see in figure 3.10 GRU consists of two gates the reset gate and theupdate gate The task of reset gate rt is to determine how much information from the pastshould be ignored like the forget gate of LSTM On the other hand, the update gate selectswhich information to collect from the new data, and the past state to pass into the next state Wecan see that, unlike LSTM, GRU does not have a cell state and only has two sigmoid gates, andone of the two ones performs two tasks, the update gate This explains why the GRU is faster

Trang 35

Figure 3.10: Gated Recurrent Unit [57]

Figure 3.11: Bidirectional RNN [58]

and lighter than LSTM The summary of the process is in the formula below:

rt = σ (Urht−1+Wrxt+ br)

zt = σ (Uzht−1+Wzxt+ bz)ˆ

Trang 36

Figure 3.12: Horizontal position of characters [60]

and the future to predict the current one However, this feature does not always work efficiently

in some cases because it needs the whole sequence fed into the network For example, in speechrecognition, if a person speaks continuously, the sentence will not be complete, and the networkcannot process from the end of the sequence unless he stops speaking to finish the sentence Inour problem, our inputs are always complete sequences, which are cropped text boxes, whichBRNN is a suitable approach to add to our system LSTM and GRU can applies this approach

to form the bidirectional version, called BLSTM and BGRU

3.3.4 Connectionist Temporal Classification (CTC)

Connectionist Temporal Classification (CTC) is a popular approach we can see in many quence processing models In the past, after we have the character-score matrix for the inputsequence from the HTR model, there are two tasks required to process the matrix, consisting

se-of calculating the loss value in training and decoding the matrix to get the characters se-of thesequence in inference A simple method is specifying the horizontal position of each character

in the image of the dataset as in figure 3.12, then the model can output the scores for theselocations However, it takes a lot of time to add the location for each character in the images Onthe other hand, decoding the score matrix results in many duplicate characters in the sequence,because characters’ size can be large in images This makes us need to remove the redundant du-plicates but a problem arises when the original word contains duplicate characters such as too",

"meet", which the model output can be "to", "met" after deleting duplicates The introduction

of CTC operation is to solve these problems

Firstly, we only need to give the CTC loss function the score matrix and the ground-truthtext Although we do not need to specify the horizontal locations of letters, we still have theencoding task for the ground-truth text To handle the case where an original word has dupli-cate letters, a pseudo character, called "blank" (the blank here is not a white space) We candenote this blank as any symbol or letter, and in this project, we use ¤ to present this uniquecharacter For example, for the word "meet", we insert this blank character between "ee" that

we receive "me¤et" We can insert the number of blanks as we like in the ground-truth text, like

"¤¤¤me¤¤e¤t¤" or "¤m¤e¤e¤t¤", which CTC considers the exact word "meet", but the locationbetween duplicates, like "ee", always need at least one blank CTC decode will remove theseblank to get the text in inference By this encoding process, even if the model predicts the re-

Trang 37

Figure 3.13: Character-score Matrix [60], the black lines presents the path to get character "a"("aa", "a-" and "-a"), while the dash line presents the character "" ("–")

sults as "¤mmm¤ee¤e¤¤tt¤" because of the large size of characters in the image, the result isstill "meet"

Next is the process of building the CTC loss function The matrix output from the HTRnetwork includes the score of each letter class for each time-step We will look at an example

by Harald Scheidl[60] in the figure 3.13 In this example, the letter classes include "a", "b",and the blank character denoted as "-" For each time-step in the score matrix, the sum of allthe values equals 1 To calculate the loss value, we need to find the sum of the probability ofeach alignment or path, which presents the ground-truth text Assume the ground-truth text is

"a", whose alignments are "aa", "a-" and "-a" The probabilities of these alignments are 0.4 ·

0.4 = 0.16, 0.4 · 0.6 = 0.24 and 0.6 · 0.4 = 0.24, respectively The total probability of predictingthe ground-truth sequence equals 0.64 In training, we need to increase this value to reach itsmaximum, ideally 1 We can convert this maximized problem to a minimized one, which is tominimize the loss function, the negative sum of log likelihood like in the formula 3.7, wherep(pathgt) is the probability for each path presenting the ground-truth text we can get from thescore matrix

Lctc= −∑ln(p(pathgt)) (3.7)Finally, we move to the decoding phase of CTC The main objective of this phase is to getthe correct character sequence from the matrix score in the inference process The approach we

discussed in this project is the original beam search decoding The algorithm ?? presents the

workflow of the method The inputs of the algorithm are the score matrix M of the HTR networkand the beamwidth, which is the number of the best beams to keep after filtering The algorithmscans all time-steps of the output matrix At each time-step, it filters out the low score beamsand keeps only beamwidth the best beams The score computation of a beam (like "a") at time-step t consists of calculations for the probability of the blank ending type (like "aa"), denoted

as Pnb(beam,t) and the non-blank ending one (like "a-"), as Pb(beam,t) The probability of thebeam Ptotal(beam,t) is the sum of these two probabilities

Additionally, when we add a new character into the beam, there are two cases occurring,copy and extend The copy case is when the last letter of the beam is the same as the new

Trang 38

Algorithm 2: Beam Search Decoding[61]

Input: M, beamwidth

Output: bestBeam

M: The character-score matrix

beam_width: the number of path candidates (beams) to keep

character, or the new character is a blank, otherwise, it is the extend case We look at an example

of the beam "ab" at time-step t In the next time-step, the score of the beam "a" will increase bythe amount of sum of probabilities of the copy beams "a-" and "aa" In extending case, we willform a new beam so for a new different letter like "b", we easily get "ab", while if the new letter

is "a", we use the blank ending probability Pb(beam,t − 1) at the previous time-step After weget the score for the beams, the filterSearch at line 6 will filter the previous beams according to

Ptotal(beam,t − 1) As the end of the algorithm, the best beam in the final list is returned as topresent the digital text of the image

Trang 39

4.1.1 Preparing diagram dataset

The first and foremost phase we need to focus on is to build a suitable dataset From ourchapter 2, we specify that there is not any offline handwriting diagram dataset we can find onthe internet, which requires us to preprocess the hand-drawn diagram image from the onlinedataset There are many usable options for the source of pictures, which we can acquire by self-drawing or from existing datasets However, self-drawing will take time, so we concentrate onprocessing images from publishing datasets From our survey, we can access the online diagramdatasets below:

• Digital Ink Diagram Data or DiDi[26] from Google Research and ETH Zurich containing

58655 flowchart diagrams collected from 364 participants

• FC on-line dataset[62] with 672 flowchart diagrams drawn by 24 users from Czech nical University storing in InkML format

Tech-There are also the other datasets, but we only choose these two The reasons are that the DIDIdataset has an enormous number of images, while the FC dataset provides the pattern that DIDIlacks On the other hand, DIDI is the first dataset we can access and we can reuse the images

we labeled from the work of our teammate Thinh (Thesis2021)[25] to reduce the labeling time

In the DIDI dataset, the main components consist of arrows, symbols, and texts The symbolsare rectangle, diamond, oval, parallelogram, and octagon These vertices translating into theflowchart terms are process, decision, terminal, input/output, but the octagon is not a shape used

in a flowchart, which we consider "other" The arrows are mostly straight one-direction arrows,while the location of texts can be inside a vertex, on the edge of an arrow, or stand-alone Table4.1 shows the detail of the DIDI dataset

Next, we will introduce about the FC dataset From the previous description of DIDI dataset,

we already know that this dataset does not have the shape of the sub-function and connectorsymbol The FC dataset helps us enrich our model with the connector The dataset also containsmore cursive arrows than the DIDI arrows which we will mention in detail in chapter 6 For thelast symbol, sub-function, we decide to draw new images to add to our dataset

Moving to the input processing phase, we need to normalize the training images The imagecontains only two colors in the DIDI dataset, black for the diagram and white for the background

as figure 4.1, which we exclude in this conversion However, the FC dataset images contain verythin edges, which makes it hard to the next step, labelling, while our image contains the diagramdrawn by a blue ballpoint pen, which we need to convert it to binary images Additionally, our

Trang 40

Table 4.1: Statistics of DIDI images[26]

Figure 4.1: DIDI sample

model only supports the image with size 640× 930 or smaller Therefore, our pipeline includesgrayscale conversion, resizing image, binarization using AdaptiveThreshold with kernel windowsize 5 and constant C= 7, and applying erode operation with an all-ones matrix 3 × 3 All theoperations in the pipeline is from the OpenCV library Figure 4.2 and 4.3 are the results ofpreprocessing the FC image and our self-draw image

We have the processed images from the last process, and we can start to label them fortraining our offline recognition model We use the labelme tool [27] for this step, as we mention

in the chapter 2 The labelme tool provides us with a graphical user interface and multiple types

of labeling We can label the object in the image as a polygon, which stores the coordinates ofpoints forming the shape, a rectangle that saves only the top-left and bottom-right points Thepolygon label is helpful in the segmentation task, while the rectangle is for training boundingbox prediction We can also label lines in images by using line style for straight lines or linestrip style for curve lines The tool results in each labeling file JSON for each image, avoidingconfusion in further processing the results and the images The format of labelme output asfollowing:

• version: The current version of the program, and we use the version 4.5.7

• flags: Image-level flags normally used to filter images presenting a single class Emptyunless the user specifies by command-line arguments This attribute is not required for ourlabel phase

• imagePath: The directory of the image Generally, this attribute stores only the name ofimage because the tool generates the label file at the same as the image’s location

• imageData: A hashed string of labeled data

• imageHeight: Image’s height

• imageWidth: Image’s width

• shapes: A list stores the each labels of the image Each element’s attributes are:

Tiêu đề	Building A Diagram Recognition Application With Computer Vision Approach
Tác giả	Huynh Tan Thanh, Nguyen Quang Sang
Người hướng dẫn	Dr. Nguyen Duc Dung, Dr. Tran Tuan Anh
Trường học	Ho Chi Minh University of Technology
Chuyên ngành	Computer Science and Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	120
Dung lượng	2,58 MB