Developing a pipeline for table extraction in document images

DAu dd luAn rin: XAy dung trQihOng nhdn d4ng bdng trong dnh tdi liQu Developing a pipeline for table extraction in the document image 2.. Student also develops amodel for deiecting tabie

Trang 1

GRADUATION THESIS

DEVELOPING A PIPELINE FOR

TABLE EXTRACTION

IN DOCUMENT IMAGES

MAJOR: COMPUTER SCIENCE

Mr Nguyen Nam Quan

Dr Nguyen Tien Thinh

HO CHI MINH CITY, 02/2023

Trang 2

TRU,.,NG DAI HQC BACH KHOA

KHOA:

KH & KT M6y

tinh-s0

KHMT_

HQ VA TEN: Lu Anh

NGANH:

mril

NHIEM VU LUAN AN TOT NGHIEP

i:

ki

MSSV:1852 ll2

l

luAn rin:

XAy dung trQihOng nhdn d4ng bdng trong dnh

tdi

Developing a

pipeline for

extraction

2

Nhi$m vg (y6u ciu

dung vir

ban dAu;:

Table

extraction

ofthe

critical

ofdocument

it

l-his

topic ofmost

in digital

recognition

Nowadays

it

still

tbr

industrial

This thesis focuses on studying the problem

oftable information extraction in

whole

There are three main tasks in

this

- Build

pipeline lor

cxtraction

-

model lbr

-

classification

classity

into

bordered table

for extracting

- Benchmark the models

with public/private

3

Ngiry giao

vg lu$n 6n:

4

Nglry hoirn thirnh

vq:20/1212022

5

Hg

giing vi6n hufng dAn: Phin hu6ng din:

I

)Trin Tuin Anh IIuong

dinh

tri€n tich

kiim

gi[

2) Nguy6n Nam

Quin -

tri6n

dt

3)Nguy6n Tiiln Thinh - Huong

trinh biy

cuu

-N6i

I-Vl N

dugc

86

; thcing ndm .

CHU NHIEM

MON

(Ki,

ghi rd

GIANG VIEN

CHiNH

(K!

hi rb

PH4N

D)NH

KIIOA

Nguoi duyQt (chim so bO):

Trang 3

Nsdy 26 ttuing 12 ndm2022

PHIEU CHAM BAO VE LVTN

lDinh

nguii hmmg din phun

l H9 vd t6n SV: Lu Anh Khoa

Z Od

tui, Developing

pipeline for

extraction

(X6y

tai

3 Hq t6n

nguoi

din/phan

Trin tuAn enl

4 T6ng qurit v6 ban thuyOt

minh:

56

tai

kheo:

tinh

HiQn v4t (san ph6m)

5

Tiing

-

vC:

Al:

42: Kh6

-

tay

tinh:

6

Nhtng

di6m

LVTN:

- This Gsis

pipeline for

extraction in

model for deiecting tabie

in the

classification model to classifr

into

for extracting

- The presented

pipeline

in

industrial

- This thesis hai

many different methods and

and made

assessments and analYzes

-

practical

put it into the

the construction

of

very

-

7

Nhirng thiliu

chinh

LVTN:

-

This thisis

inclined

application,

direction of

- The

overview of

works

fully

8 DE ngh!:

Dugc bio

E 86

bio v€ tr Kh6ng

tr

9 3 cAu

h6i

loi trudc

- Make clear the future

works including

10 Drinh gi6 chung GAng

chft gi6i, klui TB): Gi6i Di6m :

Ki

irSh

T

tu6n enn

Trang 4

TRTIONG DAI HQC BACH KHOA

KHOA KH & KT MAY TINH

CONG HOA xA HQI CHU NGHIA VIET NAM

DQc

l{p

Tu

-

Ngity

PHIEU CHAM BAO VE LVTN

(Ditnh

ngrdi phdn

l Ho

SV: Lu Anh

MSSV: 1852112

Z Od

al: Developing

pipeline for

extraction in

3

H9

Ld Thinh Sfch

4 T6ng qudt v6 ban thuyCt

minh:

56

tdi

kh6o:

tinh

HiQn

vft

5 T6ng qu6t vO c6c bdn v6:

-

vC:

Al:

A.2: Kh6

-

tay

miiy tinh:

6

Nhtng

tli6m

cia LVTN:

o The author has a strong background in deep leaming and its applications for

vision.

o The author

a pipeline for table extraction in document

and

two

classification.

o Table detection: consists of two

detecting tables using YoloVT model

with morphology

o Table classification: classifying extracted tables into two classes: bordered

MobilenetV3 (Large)

o The proposed method can produce

results and be able to

with

7

Nhring

LVTN:

o

to

re-written:

o to

for

works

fields

classification)

o to

detail explaination for

(quantitative

qualitative)

o to

for

time

8 Dd

nghi:

x 16

tr Kh6ng

tr

9 3 cdu

hdi

ldi truoc H6i

10 Drinh giri chung (b5ng chir:

gi6i,

TB): Gi6i Di6m :

ll0

Kf

(ghi

ttn)

LE

Trang 5

DECLARATION OF AUTHORSHIP

I hereby declare that this thesis was carried out by myself under the guidance and supervision of Dr.TranTuan Anh, Dr Nguyen Tien Thinh, and Mr Nguyen Nam Quan; and that the work contained and theresults in it are true by its author and have not violated research ethics The data and figures presented

in this thesis are for analysis, comments, and evaluations from various resources by my work and havebeen fully acknowledged in the reference part

In addition, other comments, reviews, and data used by other authors, and organizations have beenacknowledged, and explicitly cited I will take full responsibility for any fraud detected in my thesis

HO CHI MINH CITY, Dec 2022

Author

Trang 6

I would like to acknowledge and give my warmest thanks to my advisors, Dr Tran Tuan Anh, Dr.Nguyen Tien Thinh, and Mr Nguyen Nam Quan who made this work possible They are the ones whobuild the first bricks of my scientific career

Besides, I would like to also acknowledge all of the instructors of Ho Chi Minh University of Technology,who have given me motivation, encouragement, and precious knowledge during the long road of myuniversity life

Last but not least, I would like to thank my family who always be there, support me throughout my life

Trang 7

The ”Developing pipeline for table extraction in document images” research topic aims to develop asystem to extract tabular regions from scanned/captured document images (invoices, reports, researchpapers ) with high accuracy and reasonable response time This thesis will propose a pipeline consists

of several steps to detect, classify and to extract data from tabular regions

Trang 8

1.1 The need for table extraction 7

1.2 The goal 7

2 Background knowledge 7 2.1 Convolution and cross-correlation in image processing 7

2.2 Convolution neural network (CNN) 8

2.2.1 Building blocks 8

2.2.2 Hyperparameters 10

2.2.3 Regularization methods 11

2.3 Image segmentation 13

2.3.1 Thresholding 14

2.3.2 K-means clustering 14

2.3.3 Trainable segmentation 14

2.4 Object detection 14

2.5 Object detection - Metric 14

2.6 Faster R-CNN 15

2.6.1 RPN (Region Proposal Network) 15

2.6.2 Anchors 16

2.6.3 RoI pooling 17

2.6.4 Detection model 17

2.6.5 Loss function 18

2.7 YOLO model 18

2.8 YOLO approach for object detection 19

2.9 YOLOv1 19

2.10 YOLOv2 20

2.11 YOLOv3 21

2.12 YOLOv4 21

2.13 YOLOv5 26

2.14 YOLOv7 architecture 27

2.14.1 CSP-ize a block 28

2.14.2 Backbone 28

2.14.3 Neck 30

2.14.4 Head 31

2.15 YOLOv7 - training techniques 32

2.15.1 Label assignment: Simple Optimal Transport Assignment (SimOTA) 32

2.15.2 Augmentation: Mosaic 34

2.15.3 Augmentation: Mixup 34

2.16 YOLOv7 - Loss function 34

2.17 Image classification 35

2.18 Image classification - Metric 35

2.19 MobilenetV3 36

2.19.1 Depthwise Separable Convolutions 36

2.19.2 Inverted residual block 37

2.19.3 Squeeze and excite (SE) 37

3 Related works 38 3.1 Traditional methods 38

3.2 Convolutional Neural Networks (CNN) 38

4 Proposed method 39 4.1 Pipeline 39

4.2 Table detection 39

4.3 Postprocess for table detection 39

4.3.1 Why ? 39

4.3.2 How ? 39

4.3.3 Result 40

4.4 Table classification 41

4.5 Table structured recognition 41

Trang 9

4.6 Structure of a table 42

4.7 Bordered table - just use some image processing 42

4.7.1 Result 43

4.8 Putting everything together 44

5 Experiments 44 5.1 Datasets 44

5.2 Training process 45

5.3 Quantitative results 45

5.3.1 Table detection - mAP 45

5.3.2 Table detection - Weighted Average F1 45

5.3.3 Table classification - Model comparisons 46

5.3.4 Speed 46

5.4 Qualitative results 47

6 Conclusion 48 6.1 Achievements 48

6.2 Limitation 48

6.3 Future works 49

List of Figures

2 Maxpool 2x2: an example of pooling layer 9

3 Some activation function 10

4 Example of Dilated Convolution 11

5 Dropout visualization 12

6 Example of data augmentation 12

7 An example of image segmentation output 13

8 Different IOU with Red bounding boxes are ground truths while the Green ones are pre-dictions 15

9 RCNN architecture 16

10 Faster RCNN achitecture 16

11 RPN in Faster-RCNN 16

12 Anchors 17

13 Detection model in Faster-RCNN 18

14 YOLOv1 architecture 19

16 YOLOv3 backbone 22

18 Darknet53 vs CSPDarknet53 23

19 CSPResBlock 24

20 Left: Sample image, Center: DropOut, Right: DropBlock 24

21 PAN structure 24

22 SPP in YOLOv4 25

23 Hard label vs Smooth label 25

24 Cosine Annealing Learning rate 25

25 Groundtruth: green Predict: Black Using L1 yields 9.07 for all 3 cases but their IOU are different by a large margin IOU is also used to evaluate an object detection model So using IOU as a loss is a logical improvement 26

27 Yolov7 architecture 27

28 CSP-ized a block 28

29 ELAN block 29

30 Transition block 29

31 Yolov7 backbone 30

32 CSP-OSA block 30

33 RepConv block 31

34 Implicit knowledge 32

35 YOLOv7 head, with implicit knowledge 32

Trang 10

36 An example of mosaic augmentation, 4 images are ”merged” together to create a new

sample Red bounding boxes annotate labels 34

37 Mixup augmentation example 34

38 Depthwise Convolution, visualized 36

39 Left: Normal convolution Right: Depthwise separable convolution 37

40 Squeeze and excitation block 37

41 CascadeTabNet architecture from [23] 39

42 Overview 39

43 Before (left) and After (right) Notice that on the left the table does not have enough outer bordered lines 40

44 Examples of bordered tables 41

45 Examples of borderless tables 41

46 Original bordered table 43

47 After cell indexing, each cell has the form start row, start col end row, end col 44

48 Excel result of figure 46 Note that tesseract cannot read the text in some cells 44

49 Correct detection on publaynet dataset 47

50 Correct detection on fintabnet dataset 48

51 Wrong cases: Partial detection (top), missed tables(bottoms) Green boxes are ground truth while red boxes are predictions 49

List of Tables

2 Data statistic for table classification 45

3 Result on test set of each dataset, table detection 45

4 Comparison with ICDAR19 Competition on Table Detection and Recognition, track A2 with previous participants Scores of other teams are taken directly from [11] 46

5 Comparisons between different classification models 46

6 Speed measurements 47

Trang 11

1 Introduction

1.1 The need for table extraction

With the trend of digital conversion, the amount of document images has increased exponentially Toautomate the process of extracting information in those images, many methods have been proposed fordifferent types of information arrangements Besides text, tables are one of the most used methods toarrange information in documents Its purpose is to group information related to a topic together tohelp the reader compare and retrieve information faster However, because of their complex and diversestyles, it is hard to parse tabular data from document images into a well-structured machine-readableformat

Document types that contain tables as one of the main elements include invoices, financial reports, andforms To understand these types of documents effectively, a table extraction tool for images is necessary.That is the motivation for this thesis

1.2 The goal

The goal for thesis is to create a pipeline consists of deep learning models and image processing techniques

to extract tables from an input document image

There are 4 sub goals as follows:

• Build a comprehensive pipeline for table extraction in the document image (table extractionpipeline)

• Develop a model for detecting table regions in the document image (table detection)

• Develop a classification model to classify the detected table into a borderless table and a borderedtable for extracting the table’s cells (table classification)

• Benchmark the models with public/private datasets

Some constraints on tables:

• Captions and table names do not count as parts of a table (Usually these elements are text and arevery close to tables)

• The tables types consists of bordered tables and borderless tables (Will be defined below)

• Table size ranges from small (2 to 5 rows and/or columns, often seen in research papers) to large(many rows, occupies a large area of the image ( 80%) )

• The cells, texts and lines may contain colors rather than black and white

For the input image:

• Input image must be from scanned documents or exported pdfs (different with document capturedwith mobile devices)

• Image quality must be acceptable (Bad image quality includes: blurred, distorted, noisy, )

2.1 Convolution and cross-correlation in image processing

In image processing, convolution is the process of transforming an image by applying a kernel over eachpixel and its local neighbors across the entire image The kernel is a matrix of values whose size andvalues determine the transformation effect of the convolution process

Mathematically, the convolution between an image and a kernel can be written as:

Trang 12

where g(x, y) is the filtered image, f (x, y) is the original image, ω is the filter kernel Every element ofthe filter kernel is considered by −a ≤ dx ≤ a and −b ≤ dy ≤ b.

The difference of convolution and cross-correlation can be seen below:

A convolution can be seen as a cross-correlation with a kernel being rotated by 180◦

With carefully hand-crafted kernels, blur, sharpen, detect edges in images are a few possible imageprocessing techniques using convolutions

2.2 Convolution neural network (CNN)

Instead of relying on hand-crafted kernels to produce the desired output, the goal of CNN is to learn theparameters of the kernels itself

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network(ANN), most commonly applied to analyze visual imagery CNNs are also known as Shift Invariant

or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of theconvolution kernels or filters that slide along input features and provide translation-equivariant responsesknown as feature maps Counter-intuitively, most convolutional neural networks are not invariant totranslation, due to the downsampling operation they apply to the input They have applications inimage and video recognition, recommender systems, image classification, image segmentation, medicalimage analysis, natural language processing, brain–computer interfaces, and financial time series.CNNs are regularized versions of multilayer perceptrons Multilayer perceptrons usually mean fully con-nected networks, that is, each neuron in one layer is connected to all neurons in the next layer The ”fullconnectivity” of these networks make them prone to overfitting data Typical ways of regularization, orpreventing overfitting, include: penalizing parameters during training (such as weight decay) or trimmingconnectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization:they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexityusing smaller and simpler patterns embossed in their filters Therefore, on a scale of connectivity andcomplexity, CNNs are on the lower extreme

Convolutional networks were inspired by biological processes in that the connectivity pattern betweenneurons resembles the organization of the animal visual cortex Individual cortical neurons respond tostimuli only in a restricted region of the visual field known as the receptive field The receptive fields ofdifferent neurons partially overlap such that they cover the entire visual field

CNNs use relatively little pre-processing compared to other image classification algorithms This meansthat the network learns to optimize the filters (or kernels) through automated learning, whereas intraditional algorithms these filters are hand-engineered This independence from prior knowledge andhuman intervention in feature extraction is a major advantage

A typical CNN can be seen in figure 1

2.2.1 Building blocks

Convolutional layer

The convolutional layer is the core building block of a CNN The layer’s parameters consist of a set oflearnable filters (or kernels), which have a small receptive field, but extend through the full depth of theinput volume During the forward pass, each filter is convolved across the width and height of the inputvolume, computing the dot product between the filter entries and the input, producing a 2-dimensionalactivation map of that filter As a result, the network learns filters that activate when it detects somespecific type of feature at some spatial position in the input

Trang 13

Figure 1: A typical CNN

Stacking the activation maps for all filters along the depth dimension forms the full output volume of theconvolution layer Every entry in the output volume can thus also be interpreted as an output of a neuronthat looks at a small region in the input and shares parameters with neurons in the same activation map

Pooling layer

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling There areseveral non-linear functions to implement pooling, where max pooling is the most common It partitionsthe input image into a set of rectangles and, for each such sub-region, outputs the maximum

Intuitively, the exact location of a feature is less important than its rough location relative to otherfeatures This is the idea behind the use of pooling in convolutional neural networks The pooling layerserves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprint and amount of computation in the network, and hence to also control overfitting.This is known as down-sampling It is common to periodically insert a pooling layer between successiveconvolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNNarchitecture While pooling layers contribute to local translation invariance, they do not provide globaltranslation invariance in a CNN, unless a form of global pooling is used The pooling layer commonlyoperates independently on every depth, or slice, of the input and resizes it spatially A very commonform of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples everydepth slice in the input by 2 along both width and height, discarding 75% of the activations:

fX,Y(S) = max11,b=0S2X+a,2Y +b

A visualization can be seen in figure 2

Figure 2: Maxpool 2x2: an example of pooling layer

Trang 14

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent

f (x) = tanh(x), and the sigmoid function σ(x) = (1 + e−x)−1 ReLU is often preferred to other functionsbecause it trains the neural network several times faster without a significant penalty to generalizationaccuracy

Some commonly used activation function is shown in figure 3

Figure 3: Some activation function

Fully connected layer

After several convolutional and max pooling layers, the final classification is done via fully connectedlayers Neurons in a fully connected layer have connections to all activations in the previous layer, asseen in regular (non-convolutional) artificial neural networks Their activations can thus be computed as

an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned

or fixed bias term)

Trang 15

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters whilehigher layers can have more To equalize computation at each layer, the product of feature values vawith pixel position is kept roughly constant across layers Preserving more information about the inputwould require keeping the total number of activations (number of feature maps times number of pixelpositions) non-decreasing from one layer to the next.

The number of feature maps directly controls the capacity and depends on the number of availableexamples and task complexity

Filter size

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.The challenge is to find the right level of granularity so as to create abstractions at the proper scale,given a particular data set, and without overfitting

Pooling type and size

Max pooling is typically used, often with a 2x2 dimension This implies that the input is drasticallydownsampled, reducing processing cost

Large input volumes may warrant 4×4 pooling in the lower layers Greater pooling reduces the dimension

of the signal, and may result in unacceptable information loss Often, non-overlapping pooling windowsperform best

Dilation

Dilation involves ignoring pixels within a kernel This reduces processing/memory potentially withoutsignificant signal loss A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9(evenly spaced) pixels Accordingly, dilation of 4 expands the kernel to 9x9

Figure 4: Example of Dilated Convolution

2.2.3 Regularization methods

Dropout

Because a fully connected layer occupies most of the parameters, it is prone to overfitting One method

to reduce overfitting is dropout At each training stage, individual nodes are either ”dropped out” of thenet (ignored) with probability 1−p or kept with probability p , so that a reduced network is left; incomingand outgoing edges to a dropped-out node are also removed Only the reduced network is trained on thedata in that stage The removed nodes are then reinserted into the network with their original weights

In the training stages, p is usually 0.5; for input nodes, it is typically much higher because information

is directly lost when input nodes are ignored

At testing time after training has finished, we would ideally like to find a sample average of all possible

2n dropped-out networks; unfortunately this is unfeasible for large values of n However, we can find

an approximation by using the full network with each node’s output weighted by a factor of p p, sothe expected value of the output of any node is the same as in the training stages This is the biggestcontribution of the dropout method: although it effectively generates 2n neural nets, and as such allowsfor model combination, at test time only a single network needs to be tested

Trang 16

By avoiding training all nodes on all training data, dropout decreases overfitting The method alsosignificantly improves training speed This makes the model combination practical, even for deep neuralnetworks The technique seems to reduce node interactions, leading them to learn more robust featuresthat better generalize to new data.

A visualiztion is shown in figure 5

Figure 5: Dropout visualization

Artificial data / Data augmentation

Because the degree of model overfitting is determined by both its power and the amount of training itreceives, providing a convolutional network with more training examples can reduce overfitting Becausethese networks are usually trained with all available data, one approach is to either generate new datafrom scratch (if possible) or perturb existing data to create new ones Data augmentation can rangedfrom simple image processing like changing the hue, scaled, rotated angle of an existed image, to moremodern ones like mixup [37] where a new data point is created using multiple exisiting data points

An example of data augmentation can be seen in figure 6

Figure 6: Example of data augmentation

Early stopping

Trang 17

One of the simplest methods to prevent overfitting of a network is to simply stop the training beforeoverfitting has had a chance to occur It comes with the disadvantage that the learning process is halted.

Number of parameters

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting thenumber of hidden units in each layer or limiting network depth For convolutional networks, the filtersize also affects the number of parameters Limiting the number of parameters restricts the predictivepower of the network directly, reducing the complexity of the function that it can perform on the data,and thus limits the amount of overfitting This is equivalent to a ”zero norm”

Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional

to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error ateach node The level of acceptable model complexity can be reduced by increasing the proportionalityconstant(’alpha’ hyperparameter), thus increasing the penalty for large weight vectors

L2 regularization is the most common form of regularization It can be implemented by penalizing thesquared magnitude of all parameters directly in the objective The L2 regularization has the intuitiveinterpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors Due tomultiplicative interactions between weights and inputs this has the useful property of encouraging thenetwork to use all of its inputs a little rather than some of its inputs a lot

2.3 Image segmentation

In digital image processing and computer vision, image segmentation is the process of partitioning adigital image into multiple image segments, also known as image regions or image objects (sets of pixels).The goal of segmentation is to simplify and/or change the representation of an image into something that

is more meaningful and easier to analyze Image segmentation is typically used to locate objects andboundaries (lines, curves, etc.) in images More precisely, image segmentation is the process of assigning

a label to every pixel in an image such that pixels with the same label share certain characteristics

An example of image segmentation is shown in figure 7

Figure 7: An example of image segmentation output

There are 2 classes of image segmentation techniques:

• Classical computer vision approach

• AI based techniques

Trang 18

2.3.1 Thresholding

The simplest method of image segmentation is called the thresholding method This method is based on

a clip-level (or a threshold value) to turn a gray-scale image into a binary image

The key of this method is to select the threshold value (or values when multiple-levels are selected) Severalpopular methods are used in industry including the maximum entropy method, balanced histogramthresholding, Otsu’s method [22] (maximum variance), and k-means clustering

2.3.2 K-means clustering

The K-means algorithm is an iterative technique that is used to partition an image into K clusters Thebasic algorithm is

• Pick K cluster centers, either randomly or based on some heuristic method, for example K-means++

• Assign each pixel in the image to the cluster that minimizes the distance between the pixel and thecluster center

• Re-compute the cluster centers by averaging all of the pixels in the cluster

• Repeat steps 2 and 3 until convergence is attained (i.e no pixels change clusters)

In this case, distance is the squared or absolute difference between a pixel and a cluster center Thedifference is typically based on pixel color, intensity, texture, and location, or a weighted combination ofthese factors K can be selected manually, randomly, or by a heuristic This algorithm is guaranteed toconverge, but it may not return the optimal solution The quality of the solution depends on the initialset of clusters and the value of K

2.3.3 Trainable segmentation

Most of the aforementioned segmentation methods are based only on color information of pixels in theimage Humans use much more knowledge when performing image segmentation, but implementing thisknowledge would cost considerable human engineering and computational time, and would require a hugedomain knowledge database which does not currently exist Trainable segmentation methods, such asneural network segmentation, overcome these issues by modeling the domain knowledge from a dataset

of labeled pixels

2.4 Object detection

Object detection is a task which deals with detecting objects in any image or a video frame Withthe rising and superior results of deep learning, all state-of-the-art object detection methods today arebuilt with deep learning approaches They can be categorized into two main types: one-stage methodsand two stage-methods One-stage methods prioritize inference speed while two-stage methods prioritizedetection accuracy

2.5 Object detection - Metric

The main metric of object detection is mAP (mean average precision) To understand mAP, first we have

to know about IOU, Recall, Precision and Average precision:

Intersection over union (IOU) measures how much the predicted region is overlapping with the actualground truth regions, as shown in Figure 8

IoU = Area of overlap region

Area of union region (1)

Precision is defined as the ratio of numbers of predicted regions that are tables over the number ofground truth regions

P = #tables in predicted regions

#number of grown truths =

T P

Recall is defined as the ratio of the numbers of predicted regions that are tables over the number ofpredicted regions

R = #tables in predicted regions

#number of predicted regions =

T P

Trang 19

Figure 8: Different IOU with Red bounding boxes are ground truths while the Green ones are predictions

Average precision for a class:

AP = 1nX

k

(Recall[k] − Recall[k − 1]) ∗ P recision[k] (4)

where n is the number of IOU thresholds, Recall[k], P recision[k] is the recall and precision for IOUiouT hreshold[k] (iouThreshold[] = [0.5, 0.55, 0.60, , 0.90, 0.95])

To define if a predicted regions is a table given a ground truth, we need to compute their intersectionover union (IOU) and if their IOU is greater than a threshold, then the predicted region is count as atrue positive

Mean average precision is calculated by taking average AP of all classes:

mAP = 1

nX

2.6.1 RPN (Region Proposal Network)

Faster RCNN use a sub-network called RPN to extract the regions containing objects (RoI - Region ofInterest), differs from its predecessors, RCNN and Fast-RCNN

RCNN use Selective Search as its region proposal extractor The number of regions extracted are around

2000, Then the regions are resized into the same size and go through a pretrained CNN model, thenlocaize the offset and the object class But 2000 regions is a large number, making the model runs superslow (figure 9)

Fast-RCNN improves this by using a pretrained CNN to extract feature maps, then using SelectiveSearch on those feature maps instead of the original image So the speed increased by a large margin.But because of Selective Search, the model inference time still takes too long (around 2s/image) (figure10)

With Faster-RCNN, instead of using Selective Search, a sub-network is used to extract regions making

it even faster and is designed to be an end-to-end trainable network

RPN using 1 conv later with 512 channels, kernel size = (3, 3) on feature map Then it split into 2branches: one for object classification, one for bounding box regression Both of them using 1 conv layerwith kernel sizw = (1, 1) but with different output channels With binary object classification, it has2k channels output, with k is the number of anchors to determine if that anchors has object or is thebackground With bounding box regression, it has 4k channels output, with 4 represents 4 offset (x, y,

w, h)

Because the input image size is not fixed, the output size of RPN has the same property For examplewith an input image size WxHx3 and down sampling = 16, RPN classify and RPN bounding box has 18

* (W / 16) * (H / 16) v`a 36 * (W / 16) * (H / 16), respectively

Trang 20

Figure 9: RCNN architecture

Figure 10: Faster RCNN achitecture

Figure 11: RPN in Faster-RCNN

2.6.2 Anchors

What are anchors ?

Anchors are pre-defined boxes, are known before training model Anchors in Faster-RCNN are defined

as 9 anchors with every pixel in the feature map The total numbers of anchors are based on the size

of feature map For example feature map after goes through backbone has size WxHxC (with C is thechannels of feature map), then the total number of anchors will be WxHx9 (9 is the number of anchors

of one pixel)

Anchors are in different size and ratio ( figure 12 )

Anchors will be assigned as positive/negative (object/background) based on overlap area or IOU overlapwith ground truth bounding box following the rule:

Trang 21

Figure 12: Anchors

• The anchor with the highest IOU with ground truth box will be positive

• Anchors with IOU ≥ 0.7 will be positive

• Anchors with IOU < 0.3 will be negative (background)

• Anchors with IOU 0.3 ≥ x < 0.7 will be neutral and is not considered in model training

The RoIs after RPN step will contain overlap regions, so a method is proposed to filter-out those regions,called non-maximum suppression (NMS) The idead is simple:

• Let the set R is the set contained the RoI after RPN step and their confidence score set S, tively, an overlap theshold N and an empty set D

respec-• Mark the RoI with the highest confidence score and remove from R, insert to D

• Compare the new RoI with every RoI in R with IOU If IOU is greater then overlap threshold N,remove the RoI in R

• Repeat step 2, 3 until set R is empty

But NMS has its own weakness, too For example N = 0.5 There is some RoIs with IOU = 0.51, theirconfidence score is very high but they can still be remove from R Vice versa, there are RoIs with IOU ¡0.5 and low confidence scores are not removed from R, making the model appear worse

Soft-NMS is proposed to solve this problem Instead of removing RoIs that has high overlap thresold andhigh confidence score, we decrease the confidence score based on IOU:

(

si IOU (M, bi) < N

si∗ (1 − IOU (M, bi)) IOU (M, bi) ≥ N2.6.3 RoI pooling

RoI pooling makes the output size of feature map fixed RoI pooling is a must as the final layers of themodel are 2 fully connected branches which required fixed input size

2.6.4 Detection model

After RoI pooling, we have the output feature maps with fixed size, they will be flatten and go through

2 fully connected layers (figure 13):

• Object classification N+1 class (N is the number of classes, +1: background)

• Bounding box regression to locate the RoI with 4N output, represents 4 coordinated (x, y, w, h)

Trang 22

Figure 13: Detection model in Faster-RCNN

NMS is done like RPN step above

2.6.5 Loss function

Faster-RCNN loss consists of 4 parts:

• RPN classification (object or background)

• RPN regression (anchor - region proposal)

• Fast-RCNN classification (N+1 classes)

• Fast-RCNN bounding box regression (region proposal - ground truth)

L({pi}), {ti}) = 1

NclsX

i

Lcls(pi, p∗i) + λ 1

NregX

i

Lcls(ti, t∗i)

where

• i is the index of an anchor in a mini-batch, pi is the probability for an anchor to be an object

• Lcls is the binary cross entropy for the question: does the anchor contain object ? for RPN andmulti-class cross entropy for Faster-RCNN

• Lreg is the loss for bounding box regression using Smooth L1 Loss Smooth L1 Loss can be seen as

an combination of L1 and L2 loss:







|x| |x| > α1

YOLO family of models has continued to evolve since the first initial release

• YOLOv2 [25] made a number of iterative improvements on top of YOLO including BatchNorm,higher resolution, and anchor boxes

• YOLOv3 [26] built upon previous models by adding an objectness score to bounding box prediction,added connections to the backbone network layers, and made predictions at three separate levels

of granularity to improve performance on smaller objects

• YOLOv4 [7] introduced improvements like improved feature aggregation, a ”bag of freebies” (withaugmentations), mish activation, and more

• YOLOv5 [1] is the first model in the ”YOLO family” to not be released with an accompanyingpaper and with ongoing development The Focus layer [?] introduced in this version, is evolved

Trang 23

from YOLOv3 structure It helps to reduce required CUDA memory reduce parameters, increaseforward and backward propagation speed.

• YOLOv7 [32] is the successor of YOLOv4, incorporates the techiniques from yolov4, yolov5 and

”trainable bag of freebies”, pushing the limit of object detection even more

2.8 YOLO approach for object detection

The main idea of YOLO is to divide the image into S x S grid For each grid cell, there are a bunch ofanchors, each of them will predict one object with the representation (x,y,width,height,class)

First, the image goes through a CNN to create a S × S feature map, called a grid YOLO will detectobject in each of S × S cell Each cell prediction contains B bounding boxes and probability for Cclasses Each bounding box consists of 5 variables: center coordinates (x, y), width and height (w, h) andconfidence The confidence of bounding box represents if that bounding box has any objects or not So

in one cell, YOLO predicts a tensor with B × 5 + C elements, where B is the number of bounding boxes,

5 is the number of variables of a bounding box, and C is the number of classes In a S × S feature map,the shape of the output tensor from YOLO would be S × S × (B × 5 + C)

Its architecture is shown in figure 14

Figure 14: YOLOv1 architecture

Its loss function consists of several parts:

Trang 24

For every cell in feature map and for every bounding box in cell, if that cell has object the the loss will

be calculated else loss would be 0 The square root was used for width and height of bounding boxes.The idea is that if the bounding box is small, the impact of wrong regression is greater than larger boxes

For every bounding box in B predicted bounding box of cell i, if bounding box j and the ground truth

bounding box has the largest IOU, then 1objij = 1, else 0 1no objij has the opposite value of 1objij , or

1no objij = 1 − 1objij

ˆ

Ci is IOU of predicted bounding box and ground truth bounding box

The number of no object bounding boxes are large, so a hyper-parameter λno objis added to balance theloss of 2 parts

Anchor box In YOLOv2, anchor boxes were used similar to Faster-RCNN The input image was changedfrom 448 x 448 to 416 x 416 because the author wanted the final feature map size to be an odd number(with 448 x 448 the final feature map size would be 14 x 14) The idea is that images in COCO datasetusually have an object at the center of the image So having a center cell will improve the chance of itsanchor box can detect the object Using anchor box, mAP of the model decreased but its recall increased,meaning that the model can detect more objects, but the quality of detection is worse

In two-stage models (R-CNN family), anchor box works well because the first stage also consists ofoptimizing anchor box positions while YOLO does not have that stage So having some initial anchorboxes are very important for the model YOLOv2 generates anchors through k-means algorighm.Also, YOLOv2 predicts the displacement of anchor boxes tx, ty, tw, thand objectness score to with tx, tyare limited to interval [0, 1] This will limit the center coordinates x, y of bounding box when applying

Trang 25

transformations on tx, ty, which means tx, ty in a grid cell will not make the center of bounding boxes inthat cell goes outside that cell

Architecture

Backbone YOLOv3 uses a new backbone, called Darknet-53 YOLOv1’s backbone used 1x1 volution (bottleneck) from Inception Network, YOLOv2 added BatchNorm, with YOLOv3, it appliesskip-connection from ResNet, calles a Residual Block (figure 16)

Con-Neck In previous versions, detecting small objects are always a weak spot Although YOLOv2 used skipconnection from early layers to move the information from bigger feature map to later smaller featuremap, but it was not enough YOLOv3 is an upgrade for this problem YOLOv3 uses Feature PyramidNetwork (FPN), detects objects from 3 different scales (figure 17)

Other changes

Classification prediction Previous YOLO models used softmax in output of classification But fromYOLOv3, output of classification is changed to sigmoid Sigmoid function is used because of some objects

in some datasets are classified into 2 class (person and women, for example)

Bounding box prediction Keeping the idea of anchor box with k-means from YOLOv2, YOLOv3makes clear its way of chossing bounding boxes In a grid cell of a feature map, YOLOv3 generates 9anchor boxes (YOLOv2 used 5), each 3 anchor boxes belong to a scale

Con-PAN

PAN (Path Aggregation Network) is a variation of FPN (Feature Pyramid Network) IN FPN, a branch

is created for information to flow from deep layers to shallow layers PAN adds another branch to bringthe information from shallow layers back to deep layers (Figure 21)

SPP

Trang 26

Figure 16: YOLOv3 backbone

SPP (Spatial pyramid pooling) is a special block at the end of the backbone It outputs 4 feature mapswith the same H × W shape (the same shape with backbone output) Then they are concatenatedtogether (Figure 22)

Remove Grid sensitivity

YOLOv4 uses a new formula to calculate the bounding box position from prediction (tx, ty, tw, th)

bx = σ(tx) ∗ 1.1 − 0.05 + cx

by = σ(ty) ∗ 1.1 − 0.05 + cy

bw = pwet w

bh = phet h

Using multiple anchors for one ground truth of bounding box

In YOLOv3, only the anchor has the highest IOU with the ground truth is chosen to be positive anchor.Anchors has IOU with ground truth smaller than a threshold (0.5, for example) will be considered asnegative anchors Others are not calculated in the loss of the model, called neutral anchors

Trang 27

Figure 18: Darknet53 vs CSPDarknet53

But in YOLOv4, these neutral anchors will be considered as positive, participate in loss calculation

Label smoothing

Trang 28

Figure 19: CSPResBlock

Figure 20: Left: Sample image, Center: DropOut, Right: DropBlock

Figure 21: PAN structure

Label Smoothing is a regularization technique that introduces noise for the labels This accounts for thefact that datasets may have mistakes in them, so maximizing the likelihood of log p(y|x) directly can beharmful Assume for a small constant ϵ, the training set label y is correct with probability 1 − ϵ andincorrect otherwise Label Smoothing regularizes a model based on a softmax with output values byreplacing the hard 0 and 1 classification targets with targets of ϵ

k − 1 and 1 − ϵ respectively.

Tiêu đề	Developing a pipeline for table extraction in document images
Tác giả	Lu Anh Khoa
Người hướng dẫn	Trần Tuấn Anh
Trường học	Vietnam National University, Ho Chi Minh City, Ho Chi Minh University of Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	56
Dung lượng	3,7 MB