Guided Anchoring Cascade R-CNN: An intensive improvement of R-CNN in Vietnamese Document Detection Hai Le Vietnam National University University of Information Technology Ho Chi Minh, Vi
Trang 1Guided Anchoring Cascade R-CNN: An intensive improvement of R-CNN in Vietnamese Document
Detection
Hai Le
Vietnam National University
University of Information Technology
Ho Chi Minh, Viet Nam
20520481@gm.uit.edu.vn
Truong Nguyen
Vietnam National University University of Information Technology
Ho Chi Minh, Viet Nam 20522087@gm.uit.edu.vn
Vy Le
Vietnam National University University of Information Technology
Ho Chi Minh, Viet Nam 20520355@gm.uit.edu.vn
Thuan Trong Nguyen
Vietnam National University
University of Information Technology
Ho Chi Minh, Viet Nam
18521471@gm.uit.edu.vn
Nguyen D Vo
Vietnam National University University of Information Technology
Ho Chi Minh, Viet Nam nguyenvd@uit.edu.vn
Khang Nguyen
Vietnam National University University of Information Technology
Ho Chi Minh, Viet Nam khangnttm@uit.edu.vn
Abstract—Along with the development of the world, digital
doc-uments are gradually replacing paper docdoc-uments Therefore, the
need to extract information from digital documents is increasing
and becoming one of the main interests in the field of computer
vision, particularly reading comprehension of image documents.
The problem of object detection on image documents (figures,
tables, formulas) is one of the premise problems for analyzing
and extracting information from documents Previous studies
have mostly focused on English documents In this study, we now
experiment on a Vietnamese image document dataset UIT-DODV,
which includes four classes: Table, Figure, Caption and Formula.
We test on common state-of-the-art object detection models such
as Double-Head R-CNN, Libra R-CNN, Guided Anchoring and
achieved the highest results with Guided Anchoring of 73.6%
mAP Besides, we assume that high-quality anchor boxes are keys
to the success of an anchor-based object detection models, thus
we decide to adopt Guided Anchoring in our research Moreover,
we attempt to raise the quality of the predicted bounding boxes by
utilizing Cascade R-CNN architecture, which can afford this by
its scheme, so that we can filter out as many confused bounding
boxes as possible Based on the initial evaluation results from the
common state-of-the-art object detection models, we proposed an
object detection model for Vietnamese image documents based
on Cascade R-CNN and Guided Anchoring Our proposed model
has achieved up to 76.6% mAP, 2.1% higher than the baseline
model on the UIT-DODV dataset.
Index Terms—Document Object detection, Vietnamese
Docu-ment Images, Cascade R-CNN, Guided Anchoring
I INTRODUCTION The document digitization process has been taking place in
many organizations and businesses since the growing Industry
4.0 era Traditional documents such as paper, books, invoices
are gradually transformed and replaced by digital documents
(PDF, WORD, EXCEL) stored on cloud computing services
for convenient access, searching, and archiving With such
an amount of documents, document search becomes more
difficult than ever Thus, a good model to identify the elements
in document images is necessary We decide to put the problem into the perspective of an object detection problem Document Object Detection (DOD) [1] [8] [16] is aimed at automatic detection of important elements (Caption, Table, Figure, Formula) (Figure 1) and the structure of the document page Current detection models for document [13] [20] [11] often use common languages such as English and Mandarin Chinese However, Vietnamese document [4] [9] have many challenges due to the different presentations has existed many problems In this research, we focus on improving the perfor-mance of document object detection in Vietnamese document images In specially, we experiment with 3 state-of-the-art methods: Double-Head R-CNN [19], Libra R-CNN [10] and Guided Anchoring [17] on UIT-DODV dataset [4], which
is the first Vietnamese dataset with input image objects as Caption, Table, Figure, and Formula The main feature of UIT-DODV is Vietnamese documents which bring many new challenges For example, the presentation of semantic objects creates many difficulties in feature extraction of information, formulas, etc., not only in standard mathematical formulas and non-mathematical forms We experiment on common object detection models on the UIT-DODV dataset We achieved initial results with Guided Anchoring of 73.6% on the mAP measure Based on the analysis of the performance of common object detection models, we propose an object detection model generation based on Cascade R-CNN and Guided Anchoring Our proposal achieved up to 76.6% mAP, which is 2.1% higher than the baseline presented in
The rest of the paper includes: section 2 is related to research, in section 3 we focus on introducing experimental methods In section 4, we experiment and analyze the obtained results, the paper is concluded, and the issues that need to be
Trang 2(a) Input (b) Output Fig 1 Document object detection The input is the document image, and the output is the likely object locations in the image - the green (table), blue (image), red (caption), and yellow (formula) bounding box indicates detected objects in the image
researched in the future are addressed in section 5
II RELATEDWORK
A One-stage Object Detection
Duan et al [5] proposed CenterNet is based on the approach
of bringing the object detection problem to the keypoint
estimation problem, thereby deducing the size and calculating
the bounding box for object detection problems FCOS [15] is
a fully convolutional one-stage object detector that solves
per-pixel object classification FCOS uses multi-level prediction to
improve recall and resolve overlapped bounding boxes, adding
a “centerness” branch to remove low-quality and high-margin
bounding boxes to improve overall performance YOLOv4 [18]
is a series of speed improvements over YOLOv3 [14],
com-bining a CSPNet architecture with a Darknet-53 backbone (as
in YOLOv3)
B Two-stages Object Detection
Faster R-CNN [12] is designed by adding RPN (Region
Proposal Network) instead of Selective Search to extract
regions that potentially contain objects of the image and
then performed similarly to Fast-R-CNN but much faster and
designed as an end-to-end trainable network FPN [6] proposed
by Lin et al with a top-down architecture combined with lateral connections, the network takes full leverage of high-level semantic feature maps of all sizes As a result, FPN has shown remarkable improvements in research and application of pyramid features Cai and Vasconcelos [2] proposed Cascade R-CNN as a high-quality object detector, with different heads used at different layers Each head is designed for a specific IoU threshold from small to large Cascade R-CNN reduces overfitting during training and reduces quality mismatch in inference time
III METHOD
A Previous improving R-CNN methods 1) Double-Head-R-CNN: Double-Head R-CNN [19] is de-ployed based on Feature Pyramid Network (FPN) as presented
in Fig 2 Contrary to the previous methods, which use only
a single head to extract features in the region of interest RoI for both classification and bounding box regression problems Double-Head R-CNN architecture is divided into two particu-lar heads for each classification and localization problem
Trang 3Fig 2 Differences between single head and double head, (a) a single
fully connected (2-fc) head, (b) a single convolution head, (c)
Double-Head, which is separated by a fully connected head and a convolution head
for classification and localization tasks respectively, and (d)
Double-Head-Ext, a double extension, which presents supervision from unfocused tasks
during training and leverages classification scores from both heads for final
predictions [19]
Wu et al suggested that the fc-head handles classification
task remarkably better than the conv-head, and conversely,
conv-head handles bounding box regression better than the
other one Thus, designing a double-head architecture (a
com-bination of conv-head and fc-head) will create a premise for
outstanding results in object detection problems Consequently,
double-head architecture can fully exploit the strengths of
fc-head and conv-fc-head for classification and localization
prob-lems, respectively Moreover, in Double-head-Ext architecture,
fc-head and conv-head will support each other in the
classifi-cation problem to obtain the best result
2) Libra R-CNN: Pang et al [10] proposed Libra R-CNN
based on FPN Faster R-CNN, which attempts to reach the
balance of the training Libra R-CNN proved to be highly
efficient as we can operate this design on various backbones
for both single-stage and two-stage detectors As observed
from several previous methods, the imbalance, which limited
the object detection performance, generally includes three
levels: Sample level imbalance, feature level imbalance, and
objective level imbalance
Fig 3 Pipeline and heatmap visualization of balanced feature pyramid [10]
Libra R-CNN architecture solves the imbalances at the three
mentioned levels above At the sample level, Pang et al have
designed a probability function to pick a certain number of
negative samples from the total corresponding candidates
Moving to feature level consideration, the imbalance here
oc-curs since the number of features obtained at each layer is not
equal, leading to the distinction in information and features
For that reason, Pang et al balanced the amount of information
corresponding to each layer in pyramid architecture to capture balanced semantic features through some transformations as shown in Fig 3 In addition, the authors decided to apply a non-local module to adjust the obtained features Finally, the balanced function L1 Loss is proposed to balance the effects
of two tasks, classification and localization, and the amount
of gradient contribution of the obtained samples
3) Guided Anchoring: Guided Anchoring [17] is built to improve the region proposal generation process Hence it took Region Proposal Network (RPN) as its baseline This design will predict the locations where the center of objects of interest potentially exists and the scales and aspect ratios at different locations Especially, Guided Anchoring gets rid of anchor generation, which uses default parameters In other words, an anchor’s scales and aspect ratios can now vary dynamically instead of being fixed as before As a result, it creates diversity during the training process Besides, Wang et al also do study the influence of high-quality proposals in two-stage detectors
Fig 4 Basic framework of Guided Anchoring [17]
In the Guided Anchoring method, Wang et al applied an anchor generation module which contains two branches for location and anchor shapes prediction, as its architecture is shown in Fig 4 The Anchor Location Prediction branch predicts the anchor’s location by creating a probability map
to easily figure out the likely location of the object in the image
The Anchor Shape Prediction will leverage px, yq coordi-nates of the location pre-generated in the previous stage to predict pw, hq coordinates representing the width and height
of the anchor, respectively Thereby, their model can generate the anchor shapes that most closely match the ground-truth bounding boxes However, they empirically found that it is not stable to predict these two values directly So, certain transformations have been applied to address this problem using a sub-network to utilize 1 ˆ 1 convolution to generate
an appropriate map
B Proposed Method
As mentioned at III-A, Double Head R-CNN, Libra R-CNN,
or Guided Anchoring are common kinds of architecture added
to improve the results of the Faster R-CNN model Thus, these architectures can be considered a module that can be easily assigned to other detectors, particularly in this case, the authors used Faster R-CNN In addition, Cascade R-CNN is a multi-stage extension of R-CNN in which the detector multi-stages are
Trang 4selected more sequentially, thereby improving the limitations
of Faster R-CNN mentioned at [2]
Fig 5 Guided Anchoring Cascade R-CNN architecture
Thanks to the impressive anchor generation in Guided
Anchoring, we totally can leverage semantic features to guide
the anchoring Moreover, we also aim to improve the training
phase by exploiting fully the potential of Cascade R-CNN
[2], which can filter out most of poor bounding boxes and
reduce the overfitting problem Therefore, we design a model
based on Cascade R-CNN architecture combined with Guided
Anchoring to improve object detection performance on image
documents Realizing the power of multi-level features, we
de-cide to use FPN in this method to capture valuable information
represented in feature maps with dimensions corresponding
to each level in FPN Prior to this, we use ResNeXt-101 as
our feature extractor, the architecture is described as shown in
Fig 5 Wang et al introduced Guided Anchoring, which helps
to reduce the number of anchor boxes generated compared
to previous methods, specifically RPN as shown in Fig 6,
thereby reducing costs and increasing the processing speed
while training significantly We then feed those materials above
into Cascade R-CNN network
Fig 6 Comparison between RPN proposals (top row) and GA-RPN proposals
(bottom row)[17]
IV EXPERIMENT
A Experiment Settings
Dataset: We conducted experiments with the UIT-DODV
dataset, including 1440 images for the training set, 234 images
for the validation set, and 720 images for the testing set
Configuration: We experiment on 2ˆ GPU RTX 2080 Ti
based on MMDetection [3]
Evaluation Metric We aim to compare with the baseline
results on the UIT-DODV dataset therefore mAP (mean
Aver-age Precision) [7] is used for evaluation The mAP score is the
average of AP over all classes We calculate AP at different IoU thresholds for each class and take their average to get that class’s AP We will take the results on the AP50 and AP75 metrics with IoU thresholds of 0.5 and 0.75 respectively
B Analysis
First of all, we evaluated the effectiveness of three models, including Double-Head R-CNN, Libra R-CNN, Guided An-choring on the metrics of AP50, AP75and mAP Our empirical results have been summarized in Table I In general, the best results came from the common object detection method with Guided Anchoring AP50, AP75 and mAP at 91.0%, 80.8%, 73.6% respectively In contrast, Double-Head R-CNN achieved the lowest results with AP50, AP75at 88.7%, 78.4%, and mAP
at 71.0%
Next, we specifically analyzed the performance of each class We can see clearly that every class of the dataset was experimented by using the Guided Anchoring method, which yielded the highest accuracy amongst the others with
AP metrics of the ‘Table’, ‘Figure’, ‘Caption’ and ‘Formula’ class each attaining 92.7%, 81.6%, 73.3%, and 46.9%, respec-tively Meanwhile, Double-Head R-CNN only reached 91.5%, 80.6%, 65.6%, and 46.3% with AP metrics, clearly losing
by a large margin compared to the State-of-the-art methods The performance with applying Double-Head R-CNN is not necessarily poor However, if we were to compare Double-Head R-CNN architecture with Guided Anchoring Cascade R-CNN, we can definitely say that Guided Anchoring has brought many noteworthy improvements Moreover, features
in the training process are filtered out carefully and assured the quality of the anchors Therefore, the result of Guided Anchoring proves extremely convincing Initial results on the default model Guided Anchoring with the Faster R-CNN base-line similar to Double Head or Libra had shown outstanding efficiency Thus, we propose that Cascade R-CNN be used alongside Guided Anchoring as mentioned in III-B Results show that our suggested model has brought about exceptional efficiency with mAP up to 76.6%, this is 2.1% higher than the compared with baseline as published by Dieu et al when using CascadeTabNet and Fused Loss only achieving 74.5% After that, we visualize the prediction results of the three methods Double-Head R-CNN, Libra R-CNN, and Guided Anchoring
as shown in Fig 7 However, Class ’Table’ still has false predictions on Double-Head CNN (Fig 7a) and Libra R-CNN (Fig 7b) but Guided Anchoring (Fig 7c) did an excellent job Moreover, the features of the ’Figure’ class and ’Caption’ class caused many difficulties for the prediction process of Double-Head R-CNN Although Libra R-CNN did a good job identifying ’Figure’ and ’Caption’, it mispredicted ’Formula’ objects Overall, Guided Anchoring possesses all the strengths
of the other two methods Guided Anchoring proved pretty well at correctly predicting all four classes, as we can observe
in the visualization
In addition, we also evaluate the effectiveness of our pro-posed Cascade Guided Anchoring model on the UIT-DODV dataset through the results visualized in Fig 8 when compared
Trang 5TABLE I: Experiment results on data UIT-DODV
Double-Head R-CNN 91.5 80.6 65.6 46.3 88.7 78.4 71.0 Libra R-CNN 92.5 81.0 68.2 46.0 89.6 79.1 71.9 Guided Anchoring Faster R-CNN 92.7 81.6 73.3 46.9 91.0 80.8 73.6 CascadeTabNet + Fused Loss [4] 94.3 83.0 73.3 47.5 89.1 81.6 74.5 Guided Anchoring Cascade R-CNN (Ours) 95.4 84.8 75.9 50.5 91.8 83.1 76.6
(a) Double-Head-R-CNN (b) Libra R-CNN (c) Guided Anchoring
Fig 7 Visualize prediction results of 3 models Double-Head-R-CNN, Libra R-CNN and Guided Anchoring on the UIT-DODV dataset - (green - table, blue
- figure, red - caption and yellow – formula)
Fig 8 Comparison between Faster R-CNN and Cascade R-CNN using Double-Head-R-CNN (green - table, blue - figure, red - caption and yellow - formula)
with the model The baseline model using Guided Anchoring
is Faster R-CNN Both methods Faster R-CNN in Fig 8a
and Cascade R-CNN as shown in Fig 8b have been run
experimentally in combination with Double-Head R-CNN and
have yielded the results objectively For Faster R-CNN, the
model performed very well on the object detection problem
in all four classes However, there were still overlaps in the
’Table’ class, a problem that will affect model accuracy
Meanwhile, the problem is completely solved in the Cascade
R-CNN method This shows that Cascade R-CNN handled the characteristics of document objects in general and UIT-DODV datasets in particular
V CONCLUSION ANDFUTUREWORK Realizing that the demand for information extraction and document data analysis are increasing, we conducted in-depth studies to do the object detection problem on the Vietnam dataset - UIT-DODV After experimenting on
Trang 6state-of-the-art models such as Double-Head R-CNN, Libra R-CNN, and
Guided Anchoring, we obtained the highest result is Guided
Anchoring with 73.6% on mAP measure With the above
premise, we propose the object detection model Guided
An-choring Cascade R-CNN - the combination of two methods:
Guided Anchoring and Cascade R-CNN The result of our
proposed model has reached mAP 76.6%, higher than the
baseline model on the UIT-DODV dataset by 2.1%
ACKNOWLEDGMENT This research is funded by Vietnam National University
HoChiMinh City (VNU-HCM) under grant number
DSC2021-26-03 This work was supported by the MMLab at the
Univer-sity of Information Technology, VNU-HCM We also would
like to show our gratitude to the UIT-Together research group
for sharing their pearls of wisdom with us during this research
REFERENCES [1] Jwalin Bhatt, Khurram Azeem Hashmi, Muhammad
Ze-shan Afzal, and Didier Stricker “A Survey of Graphical
Page Object Detection with Deep Neural Networks” In:
Applied Sciences 11.12 (2021).ISSN: 2076-3417 DOI:
10.3390/app11125344 URL: https://www.mdpi.com/
2076-3417/11/12/5344
[2] Zhaowei Cai and Nuno Vasconcelos “Cascade r-cnn:
Delving into high quality object detection” In:
Pro-ceedings of the IEEE conference on computer vision
and pattern recognition 2018, pp 6154–6162
[3] Kai Chen et al “MMDetection: Open MMLab
De-tection Toolbox and Benchmark” In: arXiv preprint
arXiv:1906.07155 (2019)
[4] Linh Truong Dieu, Thuan Trong Nguyen, Nguyen D
Vo, Tam V Nguyen, and Khang Nguyen “Parsing
Digitized Vietnamese Paper Documents” In: Computer
Analysis of Images and Patterns Cham: Springer
Inter-national Publishing, 2021, pp 382–392
[5] Kaiwen Duan et al “Centernet: Keypoint triplets for
object detection” In: Proceedings of the IEEE/CVF
International Conference on Computer Vision 2019,
pp 6569–6578
[6] Tsung-Yi Lin et al “Feature pyramid networks for
object detection” In: Proceedings of the IEEE
confer-ence on computer vision and pattern recognition 2017,
pp 2117–2125
[7] Tsung-Yi Lin et al “Microsoft coco: Common objects in
context” In: European conference on computer vision.
Springer 2014, pp 740–755
[8] Duong Phi Long, Nguyen Trung Hieu, Nguyen Thanh
Tuong Vi, Vo Duy Nguyen, and Nguyen Tan Tran Minh
Khang “Phat hien bang trong tai liệu dạng anh su dung
phuong phap đinh vi góc CornerNet” In: Proceedings
of Fundamental and Applied Information Technology
Research (FAIR) 2020
[9] Thuan Trong Nguyen, Thuan Q Nguyen, Long Duong, Nguyen D Vo, and Khang Nguyen “CDeRSNet: To-wards High Performance Object Detection in
Viet-namese Documents Images” In: International
Confer-ence on Multimedia Modelling (MMM) 2022
[10] Jiangmiao Pang et al “Libra r-cnn: Towards balanced
learning for object detection” In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, pp 821–830
[11] Bui Hai Phong, Thang Manh Hoang, and Thi-Lan Le
“An end-to-end framework for the detection of mathe-matical expressions in scientific document images” In:
Expert Systems(2021), e12800
[12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Sun Faster R-CNN: Towards Real-Time Object
Detec-tion with Region Proposal Networks 2016 arXiv: 1506
01497 [cs.CV]
[13] Ningning Sun, Yuanping Zhu, and Xiaoming Hu “Table Detection Using Boundary Refining via Corner
Lo-cating” In: Pattern Recognition and Computer Vision.
Ed by Zhouchen Lin et al Cham: Springer Interna-tional Publishing, 2019, pp 135–146.ISBN: 978-3-030-31654-9
[14] Yunong Tian et al “Apple detection during different growth stages in orchards using the improved
YOLO-V3 model” In: Computers and electronics in agriculture
157 (2019), pp 417–426
[15] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He
“Fcos: Fully convolutional one-stage object detection”
In: Proceedings of the IEEE/CVF international
confer-ence on computer vision 2019, pp 9627–9636 [16] Nguyen D Vo, Khanh Nguyen, Tam V Nguyen, and Khang Nguyen “Ensemble of deep object detectors
for page object detection” In: Proceedings of the 12th
International Conference on Ubiquitous Information Management and Communication 2018, pp 1–6 [17] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin “Region proposal by guided
anchor-ing” In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition 2019,
pp 2965–2974
[18] Dihua Wu, Shuaichao Lv, Mei Jiang, and Huaibo Song
“Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of
apple flowers in natural environments” In: Computers
and Electronics in Agriculture178 (2020), p 105742 [19] Yue Wu et al “Rethinking classification and
local-ization for object detection” In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition 2020, pp 10186–10195
[20] Junaid Younas et al “FFD: Figure and formula detection
from document images” In: 2019 Digital Image
Com-puting: Techniques and Applications (DICTA) IEEE
2019, pp 1–7