Guided anchoring cascade r cnn an intensive improvement of r cnn in vietnamese document detection

Guided Anchoring Cascade R-CNN: An intensive improvement of R-CNN in Vietnamese Document Detection Hai Le Vietnam National University University of Information Technology Ho Chi Minh, Vi

Trang 1

Guided Anchoring Cascade R-CNN: An intensive improvement of R-CNN in Vietnamese Document

Detection

Hai Le

Vietnam National University

University of Information Technology

Ho Chi Minh, Viet Nam

20520481@gm.uit.edu.vn

Truong Nguyen

Vietnam National University University of Information Technology

Ho Chi Minh, Viet Nam 20522087@gm.uit.edu.vn

Vy Le

Ho Chi Minh, Viet Nam 20520355@gm.uit.edu.vn

Thuan Trong Nguyen

Vietnam National University

University of Information Technology

Ho Chi Minh, Viet Nam

18521471@gm.uit.edu.vn

Nguyen D Vo

Ho Chi Minh, Viet Nam nguyenvd@uit.edu.vn

Khang Nguyen

Ho Chi Minh, Viet Nam khangnttm@uit.edu.vn

Abstract—Along with the development of the world, digital

doc-uments are gradually replacing paper docdoc-uments Therefore, the

need to extract information from digital documents is increasing

and becoming one of the main interests in the field of computer

vision, particularly reading comprehension of image documents.

The problem of object detection on image documents (figures,

tables, formulas) is one of the premise problems for analyzing

and extracting information from documents Previous studies

have mostly focused on English documents In this study, we now

experiment on a Vietnamese image document dataset UIT-DODV,

which includes four classes: Table, Figure, Caption and Formula.

We test on common state-of-the-art object detection models such

as Double-Head R-CNN, Libra R-CNN, Guided Anchoring and

achieved the highest results with Guided Anchoring of 73.6%

mAP Besides, we assume that high-quality anchor boxes are keys

to the success of an anchor-based object detection models, thus

we decide to adopt Guided Anchoring in our research Moreover,

we attempt to raise the quality of the predicted bounding boxes by

utilizing Cascade R-CNN architecture, which can afford this by

its scheme, so that we can filter out as many confused bounding

boxes as possible Based on the initial evaluation results from the

common state-of-the-art object detection models, we proposed an

object detection model for Vietnamese image documents based

on Cascade R-CNN and Guided Anchoring Our proposed model

has achieved up to 76.6% mAP, 2.1% higher than the baseline

model on the UIT-DODV dataset.

Index Terms—Document Object detection, Vietnamese

Docu-ment Images, Cascade R-CNN, Guided Anchoring

I INTRODUCTION The document digitization process has been taking place in

many organizations and businesses since the growing Industry

4.0 era Traditional documents such as paper, books, invoices

are gradually transformed and replaced by digital documents

(PDF, WORD, EXCEL) stored on cloud computing services

for convenient access, searching, and archiving With such

an amount of documents, document search becomes more

difficult than ever Thus, a good model to identify the elements

in document images is necessary We decide to put the problem into the perspective of an object detection problem Document Object Detection (DOD) [1] [8] [16] is aimed at automatic detection of important elements (Caption, Table, Figure, Formula) (Figure 1) and the structure of the document page Current detection models for document [13] [20] [11] often use common languages such as English and Mandarin Chinese However, Vietnamese document [4] [9] have many challenges due to the different presentations has existed many problems In this research, we focus on improving the perfor-mance of document object detection in Vietnamese document images In specially, we experiment with 3 state-of-the-art methods: Double-Head R-CNN [19], Libra R-CNN [10] and Guided Anchoring [17] on UIT-DODV dataset [4], which

is the first Vietnamese dataset with input image objects as Caption, Table, Figure, and Formula The main feature of UIT-DODV is Vietnamese documents which bring many new challenges For example, the presentation of semantic objects creates many difficulties in feature extraction of information, formulas, etc., not only in standard mathematical formulas and non-mathematical forms We experiment on common object detection models on the UIT-DODV dataset We achieved initial results with Guided Anchoring of 73.6% on the mAP measure Based on the analysis of the performance of common object detection models, we propose an object detection model generation based on Cascade R-CNN and Guided Anchoring Our proposal achieved up to 76.6% mAP, which is 2.1% higher than the baseline presented in

The rest of the paper includes: section 2 is related to research, in section 3 we focus on introducing experimental methods In section 4, we experiment and analyze the obtained results, the paper is concluded, and the issues that need to be

Trang 2

(a) Input (b) Output Fig 1 Document object detection The input is the document image, and the output is the likely object locations in the image - the green (table), blue (image), red (caption), and yellow (formula) bounding box indicates detected objects in the image

researched in the future are addressed in section 5

II RELATEDWORK

A One-stage Object Detection

Duan et al [5] proposed CenterNet is based on the approach

of bringing the object detection problem to the keypoint

estimation problem, thereby deducing the size and calculating

the bounding box for object detection problems FCOS [15] is

a fully convolutional one-stage object detector that solves

per-pixel object classification FCOS uses multi-level prediction to

improve recall and resolve overlapped bounding boxes, adding

a “centerness” branch to remove low-quality and high-margin

bounding boxes to improve overall performance YOLOv4 [18]

is a series of speed improvements over YOLOv3 [14],

com-bining a CSPNet architecture with a Darknet-53 backbone (as

in YOLOv3)

B Two-stages Object Detection

Faster R-CNN [12] is designed by adding RPN (Region

Proposal Network) instead of Selective Search to extract

regions that potentially contain objects of the image and

then performed similarly to Fast-R-CNN but much faster and

designed as an end-to-end trainable network FPN [6] proposed

by Lin et al with a top-down architecture combined with lateral connections, the network takes full leverage of high-level semantic feature maps of all sizes As a result, FPN has shown remarkable improvements in research and application of pyramid features Cai and Vasconcelos [2] proposed Cascade R-CNN as a high-quality object detector, with different heads used at different layers Each head is designed for a specific IoU threshold from small to large Cascade R-CNN reduces overfitting during training and reduces quality mismatch in inference time

III METHOD

A Previous improving R-CNN methods 1) Double-Head-R-CNN: Double-Head R-CNN [19] is de-ployed based on Feature Pyramid Network (FPN) as presented

in Fig 2 Contrary to the previous methods, which use only

a single head to extract features in the region of interest RoI for both classification and bounding box regression problems Double-Head R-CNN architecture is divided into two particu-lar heads for each classification and localization problem

Trang 3

Fig 2 Differences between single head and double head, (a) a single

fully connected (2-fc) head, (b) a single convolution head, (c)

Double-Head, which is separated by a fully connected head and a convolution head

for classification and localization tasks respectively, and (d)

Double-Head-Ext, a double extension, which presents supervision from unfocused tasks

during training and leverages classification scores from both heads for final

predictions [19]

Wu et al suggested that the fc-head handles classification

task remarkably better than the conv-head, and conversely,

conv-head handles bounding box regression better than the

other one Thus, designing a double-head architecture (a

com-bination of conv-head and fc-head) will create a premise for

outstanding results in object detection problems Consequently,

double-head architecture can fully exploit the strengths of

fc-head and conv-fc-head for classification and localization

prob-lems, respectively Moreover, in Double-head-Ext architecture,

fc-head and conv-head will support each other in the

classifi-cation problem to obtain the best result

2) Libra R-CNN: Pang et al [10] proposed Libra R-CNN

based on FPN Faster R-CNN, which attempts to reach the

balance of the training Libra R-CNN proved to be highly

efficient as we can operate this design on various backbones

for both single-stage and two-stage detectors As observed

from several previous methods, the imbalance, which limited

the object detection performance, generally includes three

levels: Sample level imbalance, feature level imbalance, and

objective level imbalance

Fig 3 Pipeline and heatmap visualization of balanced feature pyramid [10]

Libra R-CNN architecture solves the imbalances at the three

mentioned levels above At the sample level, Pang et al have

designed a probability function to pick a certain number of

negative samples from the total corresponding candidates

Moving to feature level consideration, the imbalance here

oc-curs since the number of features obtained at each layer is not

equal, leading to the distinction in information and features

For that reason, Pang et al balanced the amount of information

corresponding to each layer in pyramid architecture to capture balanced semantic features through some transformations as shown in Fig 3 In addition, the authors decided to apply a non-local module to adjust the obtained features Finally, the balanced function L1 Loss is proposed to balance the effects

of two tasks, classification and localization, and the amount

of gradient contribution of the obtained samples

3) Guided Anchoring: Guided Anchoring [17] is built to improve the region proposal generation process Hence it took Region Proposal Network (RPN) as its baseline This design will predict the locations where the center of objects of interest potentially exists and the scales and aspect ratios at different locations Especially, Guided Anchoring gets rid of anchor generation, which uses default parameters In other words, an anchor’s scales and aspect ratios can now vary dynamically instead of being fixed as before As a result, it creates diversity during the training process Besides, Wang et al also do study the influence of high-quality proposals in two-stage detectors

Fig 4 Basic framework of Guided Anchoring [17]

In the Guided Anchoring method, Wang et al applied an anchor generation module which contains two branches for location and anchor shapes prediction, as its architecture is shown in Fig 4 The Anchor Location Prediction branch predicts the anchor’s location by creating a probability map

to easily figure out the likely location of the object in the image

The Anchor Shape Prediction will leverage px, yq coordi-nates of the location pre-generated in the previous stage to predict pw, hq coordinates representing the width and height

of the anchor, respectively Thereby, their model can generate the anchor shapes that most closely match the ground-truth bounding boxes However, they empirically found that it is not stable to predict these two values directly So, certain transformations have been applied to address this problem using a sub-network to utilize 1 ˆ 1 convolution to generate

an appropriate map

B Proposed Method

As mentioned at III-A, Double Head R-CNN, Libra R-CNN,

or Guided Anchoring are common kinds of architecture added

to improve the results of the Faster R-CNN model Thus, these architectures can be considered a module that can be easily assigned to other detectors, particularly in this case, the authors used Faster R-CNN In addition, Cascade R-CNN is a multi-stage extension of R-CNN in which the detector multi-stages are

Trang 4

selected more sequentially, thereby improving the limitations

of Faster R-CNN mentioned at [2]

Fig 5 Guided Anchoring Cascade R-CNN architecture

Thanks to the impressive anchor generation in Guided

Anchoring, we totally can leverage semantic features to guide

the anchoring Moreover, we also aim to improve the training

phase by exploiting fully the potential of Cascade R-CNN

[2], which can filter out most of poor bounding boxes and

reduce the overfitting problem Therefore, we design a model

based on Cascade R-CNN architecture combined with Guided

Anchoring to improve object detection performance on image

documents Realizing the power of multi-level features, we

de-cide to use FPN in this method to capture valuable information

represented in feature maps with dimensions corresponding

to each level in FPN Prior to this, we use ResNeXt-101 as

our feature extractor, the architecture is described as shown in

Fig 5 Wang et al introduced Guided Anchoring, which helps

to reduce the number of anchor boxes generated compared

to previous methods, specifically RPN as shown in Fig 6,

thereby reducing costs and increasing the processing speed

while training significantly We then feed those materials above

into Cascade R-CNN network

Fig 6 Comparison between RPN proposals (top row) and GA-RPN proposals

(bottom row)[17]

IV EXPERIMENT

A Experiment Settings

Dataset: We conducted experiments with the UIT-DODV

dataset, including 1440 images for the training set, 234 images

for the validation set, and 720 images for the testing set

Configuration: We experiment on 2ˆ GPU RTX 2080 Ti

based on MMDetection [3]

Evaluation Metric We aim to compare with the baseline

results on the UIT-DODV dataset therefore mAP (mean

Aver-age Precision) [7] is used for evaluation The mAP score is the

average of AP over all classes We calculate AP at different IoU thresholds for each class and take their average to get that class’s AP We will take the results on the AP50 and AP75 metrics with IoU thresholds of 0.5 and 0.75 respectively

B Analysis

First of all, we evaluated the effectiveness of three models, including Double-Head R-CNN, Libra R-CNN, Guided An-choring on the metrics of AP50, AP75and mAP Our empirical results have been summarized in Table I In general, the best results came from the common object detection method with Guided Anchoring AP50, AP75 and mAP at 91.0%, 80.8%, 73.6% respectively In contrast, Double-Head R-CNN achieved the lowest results with AP50, AP75at 88.7%, 78.4%, and mAP

at 71.0%

Next, we specifically analyzed the performance of each class We can see clearly that every class of the dataset was experimented by using the Guided Anchoring method, which yielded the highest accuracy amongst the others with

AP metrics of the ‘Table’, ‘Figure’, ‘Caption’ and ‘Formula’ class each attaining 92.7%, 81.6%, 73.3%, and 46.9%, respec-tively Meanwhile, Double-Head R-CNN only reached 91.5%, 80.6%, 65.6%, and 46.3% with AP metrics, clearly losing

by a large margin compared to the State-of-the-art methods The performance with applying Double-Head R-CNN is not necessarily poor However, if we were to compare Double-Head R-CNN architecture with Guided Anchoring Cascade R-CNN, we can definitely say that Guided Anchoring has brought many noteworthy improvements Moreover, features

in the training process are filtered out carefully and assured the quality of the anchors Therefore, the result of Guided Anchoring proves extremely convincing Initial results on the default model Guided Anchoring with the Faster R-CNN base-line similar to Double Head or Libra had shown outstanding efficiency Thus, we propose that Cascade R-CNN be used alongside Guided Anchoring as mentioned in III-B Results show that our suggested model has brought about exceptional efficiency with mAP up to 76.6%, this is 2.1% higher than the compared with baseline as published by Dieu et al when using CascadeTabNet and Fused Loss only achieving 74.5% After that, we visualize the prediction results of the three methods Double-Head R-CNN, Libra R-CNN, and Guided Anchoring

as shown in Fig 7 However, Class ’Table’ still has false predictions on Double-Head CNN (Fig 7a) and Libra R-CNN (Fig 7b) but Guided Anchoring (Fig 7c) did an excellent job Moreover, the features of the ’Figure’ class and ’Caption’ class caused many difficulties for the prediction process of Double-Head R-CNN Although Libra R-CNN did a good job identifying ’Figure’ and ’Caption’, it mispredicted ’Formula’ objects Overall, Guided Anchoring possesses all the strengths

of the other two methods Guided Anchoring proved pretty well at correctly predicting all four classes, as we can observe

in the visualization

In addition, we also evaluate the effectiveness of our pro-posed Cascade Guided Anchoring model on the UIT-DODV dataset through the results visualized in Fig 8 when compared

Trang 5

TABLE I: Experiment results on data UIT-DODV

Double-Head R-CNN 91.5 80.6 65.6 46.3 88.7 78.4 71.0 Libra R-CNN 92.5 81.0 68.2 46.0 89.6 79.1 71.9 Guided Anchoring Faster R-CNN 92.7 81.6 73.3 46.9 91.0 80.8 73.6 CascadeTabNet + Fused Loss [4] 94.3 83.0 73.3 47.5 89.1 81.6 74.5 Guided Anchoring Cascade R-CNN (Ours) 95.4 84.8 75.9 50.5 91.8 83.1 76.6

(a) Double-Head-R-CNN (b) Libra R-CNN (c) Guided Anchoring

Fig 7 Visualize prediction results of 3 models Double-Head-R-CNN, Libra R-CNN and Guided Anchoring on the UIT-DODV dataset - (green - table, blue

- figure, red - caption and yellow – formula)

Fig 8 Comparison between Faster R-CNN and Cascade R-CNN using Double-Head-R-CNN (green - table, blue - figure, red - caption and yellow - formula)

with the model The baseline model using Guided Anchoring

is Faster R-CNN Both methods Faster R-CNN in Fig 8a

and Cascade R-CNN as shown in Fig 8b have been run

experimentally in combination with Double-Head R-CNN and

have yielded the results objectively For Faster R-CNN, the

model performed very well on the object detection problem

in all four classes However, there were still overlaps in the

’Table’ class, a problem that will affect model accuracy

Meanwhile, the problem is completely solved in the Cascade

R-CNN method This shows that Cascade R-CNN handled the characteristics of document objects in general and UIT-DODV datasets in particular

V CONCLUSION ANDFUTUREWORK Realizing that the demand for information extraction and document data analysis are increasing, we conducted in-depth studies to do the object detection problem on the Vietnam dataset - UIT-DODV After experimenting on

Trang 6

state-of-the-art models such as Double-Head R-CNN, Libra R-CNN, and

Guided Anchoring, we obtained the highest result is Guided

Anchoring with 73.6% on mAP measure With the above

premise, we propose the object detection model Guided

An-choring Cascade R-CNN - the combination of two methods:

Guided Anchoring and Cascade R-CNN The result of our

proposed model has reached mAP 76.6%, higher than the

baseline model on the UIT-DODV dataset by 2.1%

ACKNOWLEDGMENT This research is funded by Vietnam National University

HoChiMinh City (VNU-HCM) under grant number

DSC2021-26-03 This work was supported by the MMLab at the

Univer-sity of Information Technology, VNU-HCM We also would

like to show our gratitude to the UIT-Together research group

for sharing their pearls of wisdom with us during this research

REFERENCES [1] Jwalin Bhatt, Khurram Azeem Hashmi, Muhammad

Ze-shan Afzal, and Didier Stricker “A Survey of Graphical

Page Object Detection with Deep Neural Networks” In:

Applied Sciences 11.12 (2021).ISSN: 2076-3417 DOI:

10.3390/app11125344 URL: https://www.mdpi.com/

2076-3417/11/12/5344

[2] Zhaowei Cai and Nuno Vasconcelos “Cascade r-cnn:

Delving into high quality object detection” In:

Pro-ceedings of the IEEE conference on computer vision

and pattern recognition 2018, pp 6154–6162

[3] Kai Chen et al “MMDetection: Open MMLab

De-tection Toolbox and Benchmark” In: arXiv preprint

arXiv:1906.07155 (2019)

[4] Linh Truong Dieu, Thuan Trong Nguyen, Nguyen D

Vo, Tam V Nguyen, and Khang Nguyen “Parsing

Digitized Vietnamese Paper Documents” In: Computer

Analysis of Images and Patterns Cham: Springer

Inter-national Publishing, 2021, pp 382–392

[5] Kaiwen Duan et al “Centernet: Keypoint triplets for

object detection” In: Proceedings of the IEEE/CVF

International Conference on Computer Vision 2019,

pp 6569–6578

[6] Tsung-Yi Lin et al “Feature pyramid networks for

object detection” In: Proceedings of the IEEE

confer-ence on computer vision and pattern recognition 2017,

pp 2117–2125

[7] Tsung-Yi Lin et al “Microsoft coco: Common objects in

context” In: European conference on computer vision.

Springer 2014, pp 740–755

[8] Duong Phi Long, Nguyen Trung Hieu, Nguyen Thanh

Tuong Vi, Vo Duy Nguyen, and Nguyen Tan Tran Minh

Khang “Phat hien bang trong tai liệu dạng anh su dung

phuong phap đinh vi góc CornerNet” In: Proceedings

of Fundamental and Applied Information Technology

Research (FAIR) 2020

[9] Thuan Trong Nguyen, Thuan Q Nguyen, Long Duong, Nguyen D Vo, and Khang Nguyen “CDeRSNet: To-wards High Performance Object Detection in

Viet-namese Documents Images” In: International

Confer-ence on Multimedia Modelling (MMM) 2022

[10] Jiangmiao Pang et al “Libra r-cnn: Towards balanced

learning for object detection” In: Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, pp 821–830

[11] Bui Hai Phong, Thang Manh Hoang, and Thi-Lan Le

“An end-to-end framework for the detection of mathe-matical expressions in scientific document images” In:

Expert Systems(2021), e12800

[12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian

Sun Faster R-CNN: Towards Real-Time Object

Detec-tion with Region Proposal Networks 2016 arXiv: 1506

01497 [cs.CV]

[13] Ningning Sun, Yuanping Zhu, and Xiaoming Hu “Table Detection Using Boundary Refining via Corner

Lo-cating” In: Pattern Recognition and Computer Vision.

Ed by Zhouchen Lin et al Cham: Springer Interna-tional Publishing, 2019, pp 135–146.ISBN: 978-3-030-31654-9

[14] Yunong Tian et al “Apple detection during different growth stages in orchards using the improved

YOLO-V3 model” In: Computers and electronics in agriculture

157 (2019), pp 417–426

[15] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He

“Fcos: Fully convolutional one-stage object detection”

In: Proceedings of the IEEE/CVF international

confer-ence on computer vision 2019, pp 9627–9636 [16] Nguyen D Vo, Khanh Nguyen, Tam V Nguyen, and Khang Nguyen “Ensemble of deep object detectors

for page object detection” In: Proceedings of the 12th

International Conference on Ubiquitous Information Management and Communication 2018, pp 1–6 [17] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin “Region proposal by guided

anchor-ing” In: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition 2019,

pp 2965–2974

[18] Dihua Wu, Shuaichao Lv, Mei Jiang, and Huaibo Song

“Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of

apple flowers in natural environments” In: Computers

and Electronics in Agriculture178 (2020), p 105742 [19] Yue Wu et al “Rethinking classification and

local-ization for object detection” In: Proceedings of the

IEEE/CVF conference on computer vision and pattern recognition 2020, pp 10186–10195

[20] Junaid Younas et al “FFD: Figure and formula detection

from document images” In: 2019 Digital Image

Com-puting: Techniques and Applications (DICTA) IEEE

2019, pp 1–7

Tiêu đề	Guided Anchoring Cascade R-CNN: An Intensive Improvement of R-CNN in Vietnamese Document Detection
Tác giả	Hai Le, Truong Nguyen, Vy Le, Thuan Trong Nguyen, Nguyen D. Vo, Khang Nguyen
Trường học	Vietnam National University, University of Information Technology
Chuyên ngành	Computer Vision / Document Object Detection
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	6
Dung lượng	5,1 MB