1. Trang chủ
  2. » Luận Văn - Báo Cáo

Developing a pipeline for table extraction in document images

56 9 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Developing a pipeline for table extraction in document images
Tác giả Lu Anh Khoa
Người hướng dẫn Trần Tuấn Anh
Trường học Vietnam National University, Ho Chi Minh City, Ho Chi Minh University of Technology
Chuyên ngành Computer Science
Thể loại Graduation Thesis
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 56
Dung lượng 3,7 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1.1 The need for table extraction (11)
  • 1.2 The goal (11)
  • 2.1 Convolution and cross-correlation in image processing (11)
  • 2.2 Convolution neural network (CNN) (12)
    • 2.2.1 Building blocks (12)
    • 2.2.2 Hyperparameters (14)
    • 2.2.3 Regularization methods (15)
  • 2.3 Image segmentation (17)
    • 2.3.1 Thresholding (18)
    • 2.3.2 K-means clustering (18)
    • 2.3.3 Trainable segmentation (18)
  • 2.4 Object detection (18)
  • 2.5 Object detection - Metric (18)
  • 2.6 Faster R-CNN (19)
    • 2.6.1 RPN (Region Proposal Network) (19)
    • 2.6.2 Anchors (20)
    • 2.6.3 RoI pooling (21)
    • 2.6.4 Detection model (21)
    • 2.6.5 Loss function (22)
  • 2.7 YOLO model (22)
  • 2.8 YOLO approach for object detection (23)
  • 2.9 YOLOv1 (23)
  • 2.10 YOLOv2 (24)
  • 2.11 YOLOv3 (25)
  • 2.12 YOLOv4 (25)
  • 2.13 YOLOv5 (30)
  • 2.14 YOLOv7 architecture (31)
    • 2.14.1 CSP-ize a block (32)
    • 2.14.2 Backbone (32)
    • 2.14.3 Neck (34)
    • 2.14.4 Head (35)
  • 2.15 YOLOv7 - training techniques (36)
    • 2.15.1 Label assignment: Simple Optimal Transport Assignment (SimOTA) (36)
    • 2.15.2 Augmentation: Mosaic (38)
    • 2.15.3 Augmentation: Mixup (38)
  • 2.16 YOLOv7 - Loss function (38)
  • 2.17 Image classification (39)
  • 2.18 Image classification - Metric (39)
  • 2.19 MobilenetV3 (40)
    • 2.19.1 Depthwise Separable Convolutions (40)
    • 2.19.2 Inverted residual block (41)
    • 2.19.3 Squeeze and excite (SE) (41)
  • 3.1 Traditional methods (42)
  • 3.2 Convolutional Neural Networks (CNN) (42)
  • 4.1 Pipeline (43)
  • 4.2 Table detection (43)
  • 4.3 Postprocess for table detection (43)
    • 4.3.1 Why ? (43)
    • 4.3.2 How ? (43)
    • 4.3.3 Result (44)
  • 4.4 Table classification (45)
  • 4.5 Table structured recognition (45)
  • 4.6 Structure of a table (46)
  • 4.7 Bordered table - just use some image processing (46)
    • 4.7.1 Result (47)
  • 4.8 Putting everything together (48)
  • 5.1 Datasets (48)
  • 5.2 Training process (49)
  • 5.3 Quantitative results (49)
    • 5.3.1 Table detection - mAP (49)
    • 5.3.2 Table detection - Weighted Average F1 (49)
    • 5.3.3 Table classification - Model comparisons (50)
    • 5.3.4 Speed (50)
  • 5.4 Qualitative results (51)
  • 6.1 Achievements (52)
  • 6.2 Limitation (52)
  • 6.3 Future works (53)

Nội dung

DAu dd luAn rin: XAy dung trQihOng nhdn d4ng bdng trong dnh tdi liQu Developing a pipeline for table extraction in the document image 2.. Student also develops amodel for deiecting tabie

Trang 1

GRADUATION THESIS

DEVELOPING A PIPELINE FOR

TABLE EXTRACTION

IN DOCUMENT IMAGES

MAJOR: COMPUTER SCIENCE

Mr Nguyen Nam Quan

Dr Nguyen Tien Thinh

HO CHI MINH CITY, 02/2023

Trang 2

TRU,.,NG DAI HQC BACH KHOA

KHOA:

KH & KT M6y

tinh-s0

N,roN:

KHMT_

HQ VA TEN: Lu Anh

Khoa

NGANH:

Khoa hoc

mril

tinh

NHIEM VU LUAN AN TOT NGHIEP

Chi

i:

Sinh vian phqi dltn

ki

nd), vdo tang nhd cia ban rhuyir trinh

MSSV:1852 ll2

l

DAu dd

luAn rin:

XAy dung trQihOng nhdn d4ng bdng trong dnh

tdi

liQu

Developing a

pipeline for

table

extraction

in the document image

2

Nhi$m vg (y6u ciu

v6 nQi

dung vir

s5 liQu

ban dAu;:

Table

extraction

is one

ofthe

most

critical

components

ofdocument

imagesi we can see

it

almosteverywhere in every report and documenl

l-his

is also the

topic ofmost

concem today

in digital

transformation Table

recognition

and comprehension require many techniques and research

Nowadays

it

is

still

a massive challenge

tbr

scientists and

industrial

applications

This thesis focuses on studying the problem

oftable information extraction in

a

whole

documentimage processing system

There are three main tasks in

this

research:

- Build

a comprehensive

pipeline lor

tablc

cxtraction

in thc document image

-

Develop a

model lbr

detecting tablc regions in the document inlagc

-

Develop a

classification

model to

classity

the detected table

into

a borderless table and a

bordered table

for extracting

the table's cells

- Benchmark the models

with public/private

datasets

3

Ngiry giao

nhiQm

vg lu$n 6n:

2011212021

4

Nglry hoirn thirnh

nhiQm

vq:20/1212022

5

Hg

tOn

giing vi6n hufng dAn: Phin hu6ng din:

I

)Trin Tuin Anh IIuong

dan

dinh

hutrng phdt

tri€n tich

hop

kiim

tra drinh

gi[

2) Nguy6n Nam

Quin -

I lucrng dAn ph6t

tri6n

model c6ng ngh€, thu thap

dt

liQu

3)Nguy6n Tiiln Thinh - Huong

dAn nQi dung

trinh biy

b6 cgc phuong ph6p nghidn

cuu

-N6i

dung vd y6u cAu

I-Vl N

de

dugc

thtlng qua

86

m6n

; thcing ndm .

CHU NHIEM

BQ

MON

(Ki,

vd

ghi rd

ho rAn)

GIANG VIEN

HU,ONG DAN

CHiNH

(K!

vtt

hi rb

ho l0rtNgity

PH4N

D)NH

(,LIO

KIIOA

BO MON:

Nguoi duyQt (chim so bO):

Trang 3

Nsdy 26 ttuing 12 ndm2022

PHIEU CHAM BAO VE LVTN

lDinh

cho

nguii hmmg din phun

biQnt

l H9 vd t6n SV: Lu Anh Khoa

Z Od

tui, Developing

a

pipeline for

table

extraction

in the document image

(X6y

dUng hQ th6ngnhan dang bang trong anh

tai

liQu)

3 Hq t6n

nguoi

huong

din/phan

biQn:

Trin tuAn enl

4 T6ng qurit v6 ban thuyOt

minh:

56

tai

liQu tham

kheo:

PhAn m6m

tinh

to6n:

HiQn v4t (san ph6m)

5

Tiing

qtuit vd c6c bdn v€:

-

Sti-ban

vC:

Ban

Al:

Ban

42: Kh6

kh6a:

-

Sii ban vC vC

tay

56 ba'n vC tren m6y

tinh:

6

Nhtng

uu

di6m

chinh cua

LVTN:

- This Gsis

presents a

pipeline for

table

extraction in

the document image Student also develops a

model for deiecting tabie

regions

in the

document image and a

classification model to classifr

thedetected table

into

a borderless table and a bordered table

for extracting

the table's cells

- The presented

pipeline

is also used

in

many applications

in

an

industrial

environment

- This thesis hai

researched

many different methods and

approaches

and made

appropriate

assessments and analYzes

-

The thesis has also completed some

practical

data and

put it into the

evaluation,

the construction

of

diverse data is

very

necessary

-

Students have had good experiments, evaluations, and demos

7

Nhirng thiliu

s6t

chinh

cua

LVTN:

-

This thisis

is

inclined

towards

application,

so the

direction of

research development has not beenmade clear

- The

overview of

the

works

has not been

fully

detailed, nor is the assessment comprehensive

8 DE ngh!:

Dugc bio

vQ

E 86

sung th6m d6

bio v€ tr Kh6ng

dugc b6o vQ

tr

9 3 cAu

h6i

SV phni tra

loi trudc

HQi dti,ng:

- Make clear the future

works including

research/pipeline development

10 Drinh gi6 chung GAng

chft gi6i, klui TB): Gi6i Di6m :

9.3110

Ki

ten (gh

irSh

o t6n

T

ran

tu6n enn

Trang 4

TRTIONG DAI HQC BACH KHOA

KHOA KH & KT MAY TINH

CONG HOA xA HQI CHU NGHIA VIET NAM

DQc

l{p

-

Tu

do

-

Hanh phtic

Ngity

28 rhdng 12 ndm 2022

PHIEU CHAM BAO VE LVTN

(Ditnh

cho

ngrdi phdn

biAn)

l Ho

vd t6n

SV: Lu Anh

Khoa

MSSV: 1852112

Ngdnh (chuydn ngdnh): Computer Science

Z Od

al: Developing

a

pipeline for

table

extraction in

documenl images

3

H9

t6n phan.biQn: TS

Ld Thinh Sfch

4 T6ng qudt v6 ban thuyCt

minh:

56

tdi

liQu tham

kh6o:

PhAn mAm

tinh

to6Ln:

HiQn

vft

(san phAm)

5 T6ng qu6t vO c6c bdn v6:

-

56 ban

vC:

Ban

Al:

Ban

A.2: Kh6

kh6c:

-

56 ban ve ve

tay

SO ban ve tr6n

miiy tinh:

6

Nhtng

uu

tli6m

chinh

cia LVTN:

o The author has a strong background in deep leaming and its applications for

computer

vision.

o The author

has proposed

a pipeline for table extraction in document

images

and

focused(implemented) on

two

main tasks inside: table detection and table

classification.

o Table detection: consists of two

steps:

detecting tables using YoloVT model

andpost-processing detected tables

with morphology

and connected-component analysis

o Table classification: classifying extracted tables into two classes: bordered

vsborderless using

MobilenetV3 (Large)

model

o The proposed method can produce

accurate

results and be able to

compared

with

othermethods in table extraction

7

Nhring

thii5u s6t chinh cira

LVTN:

o

The thesis needs

to

be

re-written:

o to

add a survey

for

related

works

in the research

fields

(table detection and

classification)

o to

add

detail explaination for

the proposed method and its results

(quantitative

and

qualitative)

o to

add an evaluation

for

the processing

time

8 Dd

nghi:

Du-o c bdo vQ:

x 16

sung th6m di: bao vC

tr Kh6ng

cluoc bAo v€

tr

9 3 cdu

hdi

SV phAi trd

ldi truoc H6i

d6ng:

10 Drinh giri chung (b5ng chir:

gi6i,

khri,

TB): Gi6i Di6m :

9

ll0

Kf

t6n

(ghi

16 ho

ttn)

LE

Thdnh SAch

Trang 5

DECLARATION OF AUTHORSHIP

I hereby declare that this thesis was carried out by myself under the guidance and supervision of Dr.TranTuan Anh, Dr Nguyen Tien Thinh, and Mr Nguyen Nam Quan; and that the work contained and theresults in it are true by its author and have not violated research ethics The data and figures presented

in this thesis are for analysis, comments, and evaluations from various resources by my work and havebeen fully acknowledged in the reference part

In addition, other comments, reviews, and data used by other authors, and organizations have beenacknowledged, and explicitly cited I will take full responsibility for any fraud detected in my thesis

HO CHI MINH CITY, Dec 2022

Author

Trang 6

I would like to acknowledge and give my warmest thanks to my advisors, Dr Tran Tuan Anh, Dr.Nguyen Tien Thinh, and Mr Nguyen Nam Quan who made this work possible They are the ones whobuild the first bricks of my scientific career

Besides, I would like to also acknowledge all of the instructors of Ho Chi Minh University of Technology,who have given me motivation, encouragement, and precious knowledge during the long road of myuniversity life

Last but not least, I would like to thank my family who always be there, support me throughout my life

Trang 7

The ”Developing pipeline for table extraction in document images” research topic aims to develop asystem to extract tabular regions from scanned/captured document images (invoices, reports, researchpapers ) with high accuracy and reasonable response time This thesis will propose a pipeline consists

of several steps to detect, classify and to extract data from tabular regions

Trang 8

1.1 The need for table extraction 7

1.2 The goal 7

2 Background knowledge 7 2.1 Convolution and cross-correlation in image processing 7

2.2 Convolution neural network (CNN) 8

2.2.1 Building blocks 8

2.2.2 Hyperparameters 10

2.2.3 Regularization methods 11

2.3 Image segmentation 13

2.3.1 Thresholding 14

2.3.2 K-means clustering 14

2.3.3 Trainable segmentation 14

2.4 Object detection 14

2.5 Object detection - Metric 14

2.6 Faster R-CNN 15

2.6.1 RPN (Region Proposal Network) 15

2.6.2 Anchors 16

2.6.3 RoI pooling 17

2.6.4 Detection model 17

2.6.5 Loss function 18

2.7 YOLO model 18

2.8 YOLO approach for object detection 19

2.9 YOLOv1 19

2.10 YOLOv2 20

2.11 YOLOv3 21

2.12 YOLOv4 21

2.13 YOLOv5 26

2.14 YOLOv7 architecture 27

2.14.1 CSP-ize a block 28

2.14.2 Backbone 28

2.14.3 Neck 30

2.14.4 Head 31

2.15 YOLOv7 - training techniques 32

2.15.1 Label assignment: Simple Optimal Transport Assignment (SimOTA) 32

2.15.2 Augmentation: Mosaic 34

2.15.3 Augmentation: Mixup 34

2.16 YOLOv7 - Loss function 34

2.17 Image classification 35

2.18 Image classification - Metric 35

2.19 MobilenetV3 36

2.19.1 Depthwise Separable Convolutions 36

2.19.2 Inverted residual block 37

2.19.3 Squeeze and excite (SE) 37

3 Related works 38 3.1 Traditional methods 38

3.2 Convolutional Neural Networks (CNN) 38

4 Proposed method 39 4.1 Pipeline 39

4.2 Table detection 39

4.3 Postprocess for table detection 39

4.3.1 Why ? 39

4.3.2 How ? 39

4.3.3 Result 40

4.4 Table classification 41

4.5 Table structured recognition 41

Trang 9

4.6 Structure of a table 42

4.7 Bordered table - just use some image processing 42

4.7.1 Result 43

4.8 Putting everything together 44

5 Experiments 44 5.1 Datasets 44

5.2 Training process 45

5.3 Quantitative results 45

5.3.1 Table detection - mAP 45

5.3.2 Table detection - Weighted Average F1 45

5.3.3 Table classification - Model comparisons 46

5.3.4 Speed 46

5.4 Qualitative results 47

6 Conclusion 48 6.1 Achievements 48

6.2 Limitation 48

6.3 Future works 49

List of Figures

1 A typical CNN 9

2 Maxpool 2x2: an example of pooling layer 9

3 Some activation function 10

4 Example of Dilated Convolution 11

5 Dropout visualization 12

6 Example of data augmentation 12

7 An example of image segmentation output 13

8 Different IOU with Red bounding boxes are ground truths while the Green ones are pre-dictions 15

9 RCNN architecture 16

10 Faster RCNN achitecture 16

11 RPN in Faster-RCNN 16

12 Anchors 17

13 Detection model in Faster-RCNN 18

14 YOLOv1 architecture 19

15 YOLOv2 architecture 21

16 YOLOv3 backbone 22

17 YOLOv3 architecture 23

18 Darknet53 vs CSPDarknet53 23

19 CSPResBlock 24

20 Left: Sample image, Center: DropOut, Right: DropBlock 24

21 PAN structure 24

22 SPP in YOLOv4 25

23 Hard label vs Smooth label 25

24 Cosine Annealing Learning rate 25

25 Groundtruth: green Predict: Black Using L1 yields 9.07 for all 3 cases but their IOU are different by a large margin IOU is also used to evaluate an object detection model So using IOU as a loss is a logical improvement 26

26 YOLOv5 architecture 27

27 Yolov7 architecture 27

28 CSP-ized a block 28

29 ELAN block 29

30 Transition block 29

31 Yolov7 backbone 30

32 CSP-OSA block 30

33 RepConv block 31

34 Implicit knowledge 32

35 YOLOv7 head, with implicit knowledge 32

Trang 10

36 An example of mosaic augmentation, 4 images are ”merged” together to create a new

sample Red bounding boxes annotate labels 34

37 Mixup augmentation example 34

38 Depthwise Convolution, visualized 36

39 Left: Normal convolution Right: Depthwise separable convolution 37

40 Squeeze and excitation block 37

41 CascadeTabNet architecture from [23] 39

42 Overview 39

43 Before (left) and After (right) Notice that on the left the table does not have enough outer bordered lines 40

44 Examples of bordered tables 41

45 Examples of borderless tables 41

46 Original bordered table 43

47 After cell indexing, each cell has the form start row, start col end row, end col 44

48 Excel result of figure 46 Note that tesseract cannot read the text in some cells 44

49 Correct detection on publaynet dataset 47

50 Correct detection on fintabnet dataset 48

51 Wrong cases: Partial detection (top), missed tables(bottoms) Green boxes are ground truth while red boxes are predictions 49

List of Tables

1 Data statistic for table detection 45

2 Data statistic for table classification 45

3 Result on test set of each dataset, table detection 45

4 Comparison with ICDAR19 Competition on Table Detection and Recognition, track A2 with previous participants Scores of other teams are taken directly from [11] 46

5 Comparisons between different classification models 46

6 Speed measurements 47

Trang 11

1 Introduction

1.1 The need for table extraction

With the trend of digital conversion, the amount of document images has increased exponentially Toautomate the process of extracting information in those images, many methods have been proposed fordifferent types of information arrangements Besides text, tables are one of the most used methods toarrange information in documents Its purpose is to group information related to a topic together tohelp the reader compare and retrieve information faster However, because of their complex and diversestyles, it is hard to parse tabular data from document images into a well-structured machine-readableformat

Document types that contain tables as one of the main elements include invoices, financial reports, andforms To understand these types of documents effectively, a table extraction tool for images is necessary.That is the motivation for this thesis

1.2 The goal

The goal for thesis is to create a pipeline consists of deep learning models and image processing techniques

to extract tables from an input document image

There are 4 sub goals as follows:

• Build a comprehensive pipeline for table extraction in the document image (table extractionpipeline)

• Develop a model for detecting table regions in the document image (table detection)

• Develop a classification model to classify the detected table into a borderless table and a borderedtable for extracting the table’s cells (table classification)

• Benchmark the models with public/private datasets

Some constraints on tables:

• Captions and table names do not count as parts of a table (Usually these elements are text and arevery close to tables)

• The tables types consists of bordered tables and borderless tables (Will be defined below)

• Table size ranges from small (2 to 5 rows and/or columns, often seen in research papers) to large(many rows, occupies a large area of the image ( 80%) )

• The cells, texts and lines may contain colors rather than black and white

For the input image:

• Input image must be from scanned documents or exported pdfs (different with document capturedwith mobile devices)

• Image quality must be acceptable (Bad image quality includes: blurred, distorted, noisy, )

2.1 Convolution and cross-correlation in image processing

In image processing, convolution is the process of transforming an image by applying a kernel over eachpixel and its local neighbors across the entire image The kernel is a matrix of values whose size andvalues determine the transformation effect of the convolution process

Mathematically, the convolution between an image and a kernel can be written as:

Trang 12

where g(x, y) is the filtered image, f (x, y) is the original image, ω is the filter kernel Every element ofthe filter kernel is considered by −a ≤ dx ≤ a and −b ≤ dy ≤ b.

The difference of convolution and cross-correlation can be seen below:

A convolution can be seen as a cross-correlation with a kernel being rotated by 180◦

With carefully hand-crafted kernels, blur, sharpen, detect edges in images are a few possible imageprocessing techniques using convolutions

2.2 Convolution neural network (CNN)

Instead of relying on hand-crafted kernels to produce the desired output, the goal of CNN is to learn theparameters of the kernels itself

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network(ANN), most commonly applied to analyze visual imagery CNNs are also known as Shift Invariant

or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of theconvolution kernels or filters that slide along input features and provide translation-equivariant responsesknown as feature maps Counter-intuitively, most convolutional neural networks are not invariant totranslation, due to the downsampling operation they apply to the input They have applications inimage and video recognition, recommender systems, image classification, image segmentation, medicalimage analysis, natural language processing, brain–computer interfaces, and financial time series.CNNs are regularized versions of multilayer perceptrons Multilayer perceptrons usually mean fully con-nected networks, that is, each neuron in one layer is connected to all neurons in the next layer The ”fullconnectivity” of these networks make them prone to overfitting data Typical ways of regularization, orpreventing overfitting, include: penalizing parameters during training (such as weight decay) or trimmingconnectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization:they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexityusing smaller and simpler patterns embossed in their filters Therefore, on a scale of connectivity andcomplexity, CNNs are on the lower extreme

Convolutional networks were inspired by biological processes in that the connectivity pattern betweenneurons resembles the organization of the animal visual cortex Individual cortical neurons respond tostimuli only in a restricted region of the visual field known as the receptive field The receptive fields ofdifferent neurons partially overlap such that they cover the entire visual field

CNNs use relatively little pre-processing compared to other image classification algorithms This meansthat the network learns to optimize the filters (or kernels) through automated learning, whereas intraditional algorithms these filters are hand-engineered This independence from prior knowledge andhuman intervention in feature extraction is a major advantage

A typical CNN can be seen in figure 1

2.2.1 Building blocks

Convolutional layer

The convolutional layer is the core building block of a CNN The layer’s parameters consist of a set oflearnable filters (or kernels), which have a small receptive field, but extend through the full depth of theinput volume During the forward pass, each filter is convolved across the width and height of the inputvolume, computing the dot product between the filter entries and the input, producing a 2-dimensionalactivation map of that filter As a result, the network learns filters that activate when it detects somespecific type of feature at some spatial position in the input

Trang 13

Figure 1: A typical CNN

Stacking the activation maps for all filters along the depth dimension forms the full output volume of theconvolution layer Every entry in the output volume can thus also be interpreted as an output of a neuronthat looks at a small region in the input and shares parameters with neurons in the same activation map

Pooling layer

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling There areseveral non-linear functions to implement pooling, where max pooling is the most common It partitionsthe input image into a set of rectangles and, for each such sub-region, outputs the maximum

Intuitively, the exact location of a feature is less important than its rough location relative to otherfeatures This is the idea behind the use of pooling in convolutional neural networks The pooling layerserves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprint and amount of computation in the network, and hence to also control overfitting.This is known as down-sampling It is common to periodically insert a pooling layer between successiveconvolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNNarchitecture While pooling layers contribute to local translation invariance, they do not provide globaltranslation invariance in a CNN, unless a form of global pooling is used The pooling layer commonlyoperates independently on every depth, or slice, of the input and resizes it spatially A very commonform of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples everydepth slice in the input by 2 along both width and height, discarding 75% of the activations:

fX,Y(S) = max11,b=0S2X+a,2Y +b

A visualization can be seen in figure 2

Figure 2: Maxpool 2x2: an example of pooling layer

Trang 14

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent

f (x) = tanh(x), and the sigmoid function σ(x) = (1 + e−x)−1 ReLU is often preferred to other functionsbecause it trains the neural network several times faster without a significant penalty to generalizationaccuracy

Some commonly used activation function is shown in figure 3

Figure 3: Some activation function

Fully connected layer

After several convolutional and max pooling layers, the final classification is done via fully connectedlayers Neurons in a fully connected layer have connections to all activations in the previous layer, asseen in regular (non-convolutional) artificial neural networks Their activations can thus be computed as

an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned

or fixed bias term)

Trang 15

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters whilehigher layers can have more To equalize computation at each layer, the product of feature values vawith pixel position is kept roughly constant across layers Preserving more information about the inputwould require keeping the total number of activations (number of feature maps times number of pixelpositions) non-decreasing from one layer to the next.

The number of feature maps directly controls the capacity and depends on the number of availableexamples and task complexity

Filter size

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.The challenge is to find the right level of granularity so as to create abstractions at the proper scale,given a particular data set, and without overfitting

Pooling type and size

Max pooling is typically used, often with a 2x2 dimension This implies that the input is drasticallydownsampled, reducing processing cost

Large input volumes may warrant 4×4 pooling in the lower layers Greater pooling reduces the dimension

of the signal, and may result in unacceptable information loss Often, non-overlapping pooling windowsperform best

Dilation

Dilation involves ignoring pixels within a kernel This reduces processing/memory potentially withoutsignificant signal loss A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9(evenly spaced) pixels Accordingly, dilation of 4 expands the kernel to 9x9

Figure 4: Example of Dilated Convolution

2.2.3 Regularization methods

Dropout

Because a fully connected layer occupies most of the parameters, it is prone to overfitting One method

to reduce overfitting is dropout At each training stage, individual nodes are either ”dropped out” of thenet (ignored) with probability 1−p or kept with probability p , so that a reduced network is left; incomingand outgoing edges to a dropped-out node are also removed Only the reduced network is trained on thedata in that stage The removed nodes are then reinserted into the network with their original weights

In the training stages, p is usually 0.5; for input nodes, it is typically much higher because information

is directly lost when input nodes are ignored

At testing time after training has finished, we would ideally like to find a sample average of all possible

2n dropped-out networks; unfortunately this is unfeasible for large values of n However, we can find

an approximation by using the full network with each node’s output weighted by a factor of p p, sothe expected value of the output of any node is the same as in the training stages This is the biggestcontribution of the dropout method: although it effectively generates 2n neural nets, and as such allowsfor model combination, at test time only a single network needs to be tested

Trang 16

By avoiding training all nodes on all training data, dropout decreases overfitting The method alsosignificantly improves training speed This makes the model combination practical, even for deep neuralnetworks The technique seems to reduce node interactions, leading them to learn more robust featuresthat better generalize to new data.

A visualiztion is shown in figure 5

Figure 5: Dropout visualization

Artificial data / Data augmentation

Because the degree of model overfitting is determined by both its power and the amount of training itreceives, providing a convolutional network with more training examples can reduce overfitting Becausethese networks are usually trained with all available data, one approach is to either generate new datafrom scratch (if possible) or perturb existing data to create new ones Data augmentation can rangedfrom simple image processing like changing the hue, scaled, rotated angle of an existed image, to moremodern ones like mixup [37] where a new data point is created using multiple exisiting data points

An example of data augmentation can be seen in figure 6

Figure 6: Example of data augmentation

Early stopping

Trang 17

One of the simplest methods to prevent overfitting of a network is to simply stop the training beforeoverfitting has had a chance to occur It comes with the disadvantage that the learning process is halted.

Number of parameters

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting thenumber of hidden units in each layer or limiting network depth For convolutional networks, the filtersize also affects the number of parameters Limiting the number of parameters restricts the predictivepower of the network directly, reducing the complexity of the function that it can perform on the data,and thus limits the amount of overfitting This is equivalent to a ”zero norm”

Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional

to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error ateach node The level of acceptable model complexity can be reduced by increasing the proportionalityconstant(’alpha’ hyperparameter), thus increasing the penalty for large weight vectors

L2 regularization is the most common form of regularization It can be implemented by penalizing thesquared magnitude of all parameters directly in the objective The L2 regularization has the intuitiveinterpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors Due tomultiplicative interactions between weights and inputs this has the useful property of encouraging thenetwork to use all of its inputs a little rather than some of its inputs a lot

2.3 Image segmentation

In digital image processing and computer vision, image segmentation is the process of partitioning adigital image into multiple image segments, also known as image regions or image objects (sets of pixels).The goal of segmentation is to simplify and/or change the representation of an image into something that

is more meaningful and easier to analyze Image segmentation is typically used to locate objects andboundaries (lines, curves, etc.) in images More precisely, image segmentation is the process of assigning

a label to every pixel in an image such that pixels with the same label share certain characteristics

An example of image segmentation is shown in figure 7

Figure 7: An example of image segmentation output

There are 2 classes of image segmentation techniques:

• Classical computer vision approach

• AI based techniques

Trang 18

2.3.1 Thresholding

The simplest method of image segmentation is called the thresholding method This method is based on

a clip-level (or a threshold value) to turn a gray-scale image into a binary image

The key of this method is to select the threshold value (or values when multiple-levels are selected) Severalpopular methods are used in industry including the maximum entropy method, balanced histogramthresholding, Otsu’s method [22] (maximum variance), and k-means clustering

2.3.2 K-means clustering

The K-means algorithm is an iterative technique that is used to partition an image into K clusters Thebasic algorithm is

• Pick K cluster centers, either randomly or based on some heuristic method, for example K-means++

• Assign each pixel in the image to the cluster that minimizes the distance between the pixel and thecluster center

• Re-compute the cluster centers by averaging all of the pixels in the cluster

• Repeat steps 2 and 3 until convergence is attained (i.e no pixels change clusters)

In this case, distance is the squared or absolute difference between a pixel and a cluster center Thedifference is typically based on pixel color, intensity, texture, and location, or a weighted combination ofthese factors K can be selected manually, randomly, or by a heuristic This algorithm is guaranteed toconverge, but it may not return the optimal solution The quality of the solution depends on the initialset of clusters and the value of K

2.3.3 Trainable segmentation

Most of the aforementioned segmentation methods are based only on color information of pixels in theimage Humans use much more knowledge when performing image segmentation, but implementing thisknowledge would cost considerable human engineering and computational time, and would require a hugedomain knowledge database which does not currently exist Trainable segmentation methods, such asneural network segmentation, overcome these issues by modeling the domain knowledge from a dataset

of labeled pixels

2.4 Object detection

Object detection is a task which deals with detecting objects in any image or a video frame Withthe rising and superior results of deep learning, all state-of-the-art object detection methods today arebuilt with deep learning approaches They can be categorized into two main types: one-stage methodsand two stage-methods One-stage methods prioritize inference speed while two-stage methods prioritizedetection accuracy

2.5 Object detection - Metric

The main metric of object detection is mAP (mean average precision) To understand mAP, first we have

to know about IOU, Recall, Precision and Average precision:

Intersection over union (IOU) measures how much the predicted region is overlapping with the actualground truth regions, as shown in Figure 8

IoU = Area of overlap region

Area of union region (1)

Precision is defined as the ratio of numbers of predicted regions that are tables over the number ofground truth regions

P = #tables in predicted regions

#number of grown truths =

T P

Recall is defined as the ratio of the numbers of predicted regions that are tables over the number ofpredicted regions

R = #tables in predicted regions

#number of predicted regions =

T P

Trang 19

Figure 8: Different IOU with Red bounding boxes are ground truths while the Green ones are predictions

Average precision for a class:

AP = 1nX

k

(Recall[k] − Recall[k − 1]) ∗ P recision[k] (4)

where n is the number of IOU thresholds, Recall[k], P recision[k] is the recall and precision for IOUiouT hreshold[k] (iouThreshold[] = [0.5, 0.55, 0.60, , 0.90, 0.95])

To define if a predicted regions is a table given a ground truth, we need to compute their intersectionover union (IOU) and if their IOU is greater than a threshold, then the predicted region is count as atrue positive

Mean average precision is calculated by taking average AP of all classes:

mAP = 1

nX

2.6.1 RPN (Region Proposal Network)

Faster RCNN use a sub-network called RPN to extract the regions containing objects (RoI - Region ofInterest), differs from its predecessors, RCNN and Fast-RCNN

RCNN use Selective Search as its region proposal extractor The number of regions extracted are around

2000, Then the regions are resized into the same size and go through a pretrained CNN model, thenlocaize the offset and the object class But 2000 regions is a large number, making the model runs superslow (figure 9)

Fast-RCNN improves this by using a pretrained CNN to extract feature maps, then using SelectiveSearch on those feature maps instead of the original image So the speed increased by a large margin.But because of Selective Search, the model inference time still takes too long (around 2s/image) (figure10)

With Faster-RCNN, instead of using Selective Search, a sub-network is used to extract regions making

it even faster and is designed to be an end-to-end trainable network

RPN using 1 conv later with 512 channels, kernel size = (3, 3) on feature map Then it split into 2branches: one for object classification, one for bounding box regression Both of them using 1 conv layerwith kernel sizw = (1, 1) but with different output channels With binary object classification, it has2k channels output, with k is the number of anchors to determine if that anchors has object or is thebackground With bounding box regression, it has 4k channels output, with 4 represents 4 offset (x, y,

w, h)

Because the input image size is not fixed, the output size of RPN has the same property For examplewith an input image size WxHx3 and down sampling = 16, RPN classify and RPN bounding box has 18

* (W / 16) * (H / 16) v`a 36 * (W / 16) * (H / 16), respectively

Trang 20

Figure 9: RCNN architecture

Figure 10: Faster RCNN achitecture

Figure 11: RPN in Faster-RCNN

2.6.2 Anchors

What are anchors ?

Anchors are pre-defined boxes, are known before training model Anchors in Faster-RCNN are defined

as 9 anchors with every pixel in the feature map The total numbers of anchors are based on the size

of feature map For example feature map after goes through backbone has size WxHxC (with C is thechannels of feature map), then the total number of anchors will be WxHx9 (9 is the number of anchors

of one pixel)

Anchors are in different size and ratio ( figure 12 )

Anchors will be assigned as positive/negative (object/background) based on overlap area or IOU overlapwith ground truth bounding box following the rule:

Trang 21

Figure 12: Anchors

• The anchor with the highest IOU with ground truth box will be positive

• Anchors with IOU ≥ 0.7 will be positive

• Anchors with IOU < 0.3 will be negative (background)

• Anchors with IOU 0.3 ≥ x < 0.7 will be neutral and is not considered in model training

The RoIs after RPN step will contain overlap regions, so a method is proposed to filter-out those regions,called non-maximum suppression (NMS) The idead is simple:

• Let the set R is the set contained the RoI after RPN step and their confidence score set S, tively, an overlap theshold N and an empty set D

respec-• Mark the RoI with the highest confidence score and remove from R, insert to D

• Compare the new RoI with every RoI in R with IOU If IOU is greater then overlap threshold N,remove the RoI in R

• Repeat step 2, 3 until set R is empty

But NMS has its own weakness, too For example N = 0.5 There is some RoIs with IOU = 0.51, theirconfidence score is very high but they can still be remove from R Vice versa, there are RoIs with IOU ¡0.5 and low confidence scores are not removed from R, making the model appear worse

Soft-NMS is proposed to solve this problem Instead of removing RoIs that has high overlap thresold andhigh confidence score, we decrease the confidence score based on IOU:

(

si IOU (M, bi) < N

si∗ (1 − IOU (M, bi)) IOU (M, bi) ≥ N2.6.3 RoI pooling

RoI pooling makes the output size of feature map fixed RoI pooling is a must as the final layers of themodel are 2 fully connected branches which required fixed input size

2.6.4 Detection model

After RoI pooling, we have the output feature maps with fixed size, they will be flatten and go through

2 fully connected layers (figure 13):

• Object classification N+1 class (N is the number of classes, +1: background)

• Bounding box regression to locate the RoI with 4N output, represents 4 coordinated (x, y, w, h)

Trang 22

Figure 13: Detection model in Faster-RCNN

NMS is done like RPN step above

2.6.5 Loss function

Faster-RCNN loss consists of 4 parts:

• RPN classification (object or background)

• RPN regression (anchor - region proposal)

• Fast-RCNN classification (N+1 classes)

• Fast-RCNN bounding box regression (region proposal - ground truth)

L({pi}), {ti}) = 1

NclsX

i

Lcls(pi, p∗i) + λ 1

NregX

i

Lcls(ti, t∗i)

where

• i is the index of an anchor in a mini-batch, pi is the probability for an anchor to be an object

• Lcls is the binary cross entropy for the question: does the anchor contain object ? for RPN andmulti-class cross entropy for Faster-RCNN

• Lreg is the loss for bounding box regression using Smooth L1 Loss Smooth L1 Loss can be seen as

an combination of L1 and L2 loss:

|x| |x| > α1

YOLO family of models has continued to evolve since the first initial release

• YOLOv2 [25] made a number of iterative improvements on top of YOLO including BatchNorm,higher resolution, and anchor boxes

• YOLOv3 [26] built upon previous models by adding an objectness score to bounding box prediction,added connections to the backbone network layers, and made predictions at three separate levels

of granularity to improve performance on smaller objects

• YOLOv4 [7] introduced improvements like improved feature aggregation, a ”bag of freebies” (withaugmentations), mish activation, and more

• YOLOv5 [1] is the first model in the ”YOLO family” to not be released with an accompanyingpaper and with ongoing development The Focus layer [?] introduced in this version, is evolved

Trang 23

from YOLOv3 structure It helps to reduce required CUDA memory reduce parameters, increaseforward and backward propagation speed.

• YOLOv7 [32] is the successor of YOLOv4, incorporates the techiniques from yolov4, yolov5 and

”trainable bag of freebies”, pushing the limit of object detection even more

2.8 YOLO approach for object detection

The main idea of YOLO is to divide the image into S x S grid For each grid cell, there are a bunch ofanchors, each of them will predict one object with the representation (x,y,width,height,class)

First, the image goes through a CNN to create a S × S feature map, called a grid YOLO will detectobject in each of S × S cell Each cell prediction contains B bounding boxes and probability for Cclasses Each bounding box consists of 5 variables: center coordinates (x, y), width and height (w, h) andconfidence The confidence of bounding box represents if that bounding box has any objects or not So

in one cell, YOLO predicts a tensor with B × 5 + C elements, where B is the number of bounding boxes,

5 is the number of variables of a bounding box, and C is the number of classes In a S × S feature map,the shape of the output tensor from YOLO would be S × S × (B × 5 + C)

Its architecture is shown in figure 14

Figure 14: YOLOv1 architecture

Its loss function consists of several parts:

Trang 24

For every cell in feature map and for every bounding box in cell, if that cell has object the the loss will

be calculated else loss would be 0 The square root was used for width and height of bounding boxes.The idea is that if the bounding box is small, the impact of wrong regression is greater than larger boxes

For every bounding box in B predicted bounding box of cell i, if bounding box j and the ground truth

bounding box has the largest IOU, then 1objij = 1, else 0 1no objij has the opposite value of 1objij , or

1no objij = 1 − 1objij

ˆ

Ci is IOU of predicted bounding box and ground truth bounding box

The number of no object bounding boxes are large, so a hyper-parameter λno objis added to balance theloss of 2 parts

Anchor box In YOLOv2, anchor boxes were used similar to Faster-RCNN The input image was changedfrom 448 x 448 to 416 x 416 because the author wanted the final feature map size to be an odd number(with 448 x 448 the final feature map size would be 14 x 14) The idea is that images in COCO datasetusually have an object at the center of the image So having a center cell will improve the chance of itsanchor box can detect the object Using anchor box, mAP of the model decreased but its recall increased,meaning that the model can detect more objects, but the quality of detection is worse

In two-stage models (R-CNN family), anchor box works well because the first stage also consists ofoptimizing anchor box positions while YOLO does not have that stage So having some initial anchorboxes are very important for the model YOLOv2 generates anchors through k-means algorighm.Also, YOLOv2 predicts the displacement of anchor boxes tx, ty, tw, thand objectness score to with tx, tyare limited to interval [0, 1] This will limit the center coordinates x, y of bounding box when applying

Trang 25

Figure 15: YOLOv2 architecture

transformations on tx, ty, which means tx, ty in a grid cell will not make the center of bounding boxes inthat cell goes outside that cell

Architecture

Backbone YOLOv3 uses a new backbone, called Darknet-53 YOLOv1’s backbone used 1x1 volution (bottleneck) from Inception Network, YOLOv2 added BatchNorm, with YOLOv3, it appliesskip-connection from ResNet, calles a Residual Block (figure 16)

Con-Neck In previous versions, detecting small objects are always a weak spot Although YOLOv2 used skipconnection from early layers to move the information from bigger feature map to later smaller featuremap, but it was not enough YOLOv3 is an upgrade for this problem YOLOv3 uses Feature PyramidNetwork (FPN), detects objects from 3 different scales (figure 17)

Other changes

Classification prediction Previous YOLO models used softmax in output of classification But fromYOLOv3, output of classification is changed to sigmoid Sigmoid function is used because of some objects

in some datasets are classified into 2 class (person and women, for example)

Bounding box prediction Keeping the idea of anchor box with k-means from YOLOv2, YOLOv3makes clear its way of chossing bounding boxes In a grid cell of a feature map, YOLOv3 generates 9anchor boxes (YOLOv2 used 5), each 3 anchor boxes belong to a scale

Con-PAN

PAN (Path Aggregation Network) is a variation of FPN (Feature Pyramid Network) IN FPN, a branch

is created for information to flow from deep layers to shallow layers PAN adds another branch to bringthe information from shallow layers back to deep layers (Figure 21)

SPP

Trang 26

Figure 16: YOLOv3 backbone

SPP (Spatial pyramid pooling) is a special block at the end of the backbone It outputs 4 feature mapswith the same H × W shape (the same shape with backbone output) Then they are concatenatedtogether (Figure 22)

Remove Grid sensitivity

YOLOv4 uses a new formula to calculate the bounding box position from prediction (tx, ty, tw, th)

bx = σ(tx) ∗ 1.1 − 0.05 + cx

by = σ(ty) ∗ 1.1 − 0.05 + cy

bw = pwet w

bh = phet h

Using multiple anchors for one ground truth of bounding box

In YOLOv3, only the anchor has the highest IOU with the ground truth is chosen to be positive anchor.Anchors has IOU with ground truth smaller than a threshold (0.5, for example) will be considered asnegative anchors Others are not calculated in the loss of the model, called neutral anchors

Trang 27

Figure 17: YOLOv3 architecture

Figure 18: Darknet53 vs CSPDarknet53

But in YOLOv4, these neutral anchors will be considered as positive, participate in loss calculation

Label smoothing

Trang 28

Figure 19: CSPResBlock

Figure 20: Left: Sample image, Center: DropOut, Right: DropBlock

Figure 21: PAN structure

Label Smoothing is a regularization technique that introduces noise for the labels This accounts for thefact that datasets may have mistakes in them, so maximizing the likelihood of log p(y|x) directly can beharmful Assume for a small constant ϵ, the training set label y is correct with probability 1 − ϵ andincorrect otherwise Label Smoothing regularizes a model based on a softmax with output values byreplacing the hard 0 and 1 classification targets with targets of ϵ

k − 1 and 1 − ϵ respectively.

Ngày đăng: 20/06/2023, 20:40

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[5] Abdelrahman Abdallah, Alexander Berendeyev, Islam Nuradin, and Daniyar Nurseitov. Tncr: Table net detection and classification dataset. Neurocomputing, 473:79–97, 2022 Sách, tạp chí
Tiêu đề: Tncr: Table net detection and classification dataset
Tác giả: Abdelrahman Abdallah, Alexander Berendeyev, Islam Nuradin, Daniyar Nurseitov
Nhà XB: Neurocomputing
Năm: 2022
[6] Teppi Aly, In Na, and Soo Kim. Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology. International Journal on Document Analysis and Recognition (IJDAR), 19, 09 2016 Sách, tạp chí
Tiêu đề: Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology
Tác giả: Teppi Aly, In Na, Soo Kim
Nhà XB: International Journal on Document Analysis and Recognition (IJDAR)
Năm: 2016
[7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020 Sách, tạp chí
Tiêu đề: Yolov4: Optimal speed and accuracy of object detection
Tác giả: Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao
Nhà XB: arXiv
Năm: 2020
[8] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018 Sách, tạp chí
Tiêu đề: Cascade r-cnn: Delving into high quality object detection
Tác giả: Zhaowei Cai, Nuno Vasconcelos
Nhà XB: Proceedings of the IEEE conference on computer vision and pattern recognition
Năm: 2018
[9] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg:Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733–13742, 2021 Sách, tạp chí
Tiêu đề: Repvgg:Making vgg-style convnets great again
Tác giả: Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, Jian Sun
Nhà XB: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Năm: 2021
[10] Mark Everingham, Luc Van Gool, Christopher Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 06 2010 Sách, tạp chí
Tiêu đề: The pascal visual object classes (voc) challenge
Tác giả: Mark Everingham, Luc Van Gool, Christopher Williams, John Winn, Andrew Zisserman
Nhà XB: International Journal of Computer Vision
Năm: 2010
[11] Liangcai Gao, Yilun Huang, Herv´ e D´ ejean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Maria Lang. Icdar 2019 competition on table detection and recognition (ctdar). 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1510–1515, 2019 Sách, tạp chí
Tiêu đề: Icdar 2019 competition on table detection and recognition (ctdar)
Tác giả: Liangcai Gao, Yilun Huang, Hervé Dedejan, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, Eva Maria Lang
Nhà XB: 2019 International Conference on Document Analysis and Recognition (ICDAR)
Năm: 2019
[12] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 303–312, 2021 Sách, tạp chí
Tiêu đề: Ota: Optimal transport assignment for object detection
Tác giả: Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, Jian Sun
Nhà XB: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Năm: 2021
[13] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021 Sách, tạp chí
Tiêu đề: Yolox: Exceeding yolo series in 2021
Tác giả: Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun
Nhà XB: arXiv
Năm: 2021
[14] Azka Gilani, Shah Rukh Qasim, Imran Malik, and Faisal Shafait. Table detection using deep learning.In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 771–776, 2017 Sách, tạp chí
Tiêu đề: Table detection using deep learning
Tác giả: Azka Gilani, Shah Rukh Qasim, Imran Malik, Faisal Shafait
Nhà XB: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Năm: 2017
[15] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015 Sách, tạp chí
Tiêu đề: Fast r-cnn
Tác giả: Ross Girshick
Nhà XB: Proceedings of the IEEE international conference on computer vision
Năm: 2015
[16] Gaurav Harit and Anukriti Bansal. Table detection in document images using header and trailer patterns. 12 2012 Sách, tạp chí
Tiêu đề: Table detection in document images using header and trailer patterns
Tác giả: Gaurav Harit, Anukriti Bansal
Năm: 2012
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016 Sách, tạp chí
Tiêu đề: Deep residual learning for image recognition
Tác giả: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Nhà XB: Proceedings of the IEEE conference on computer vision and pattern recognition
Năm: 2016
[18] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019 Sách, tạp chí
Tiêu đề: Searching for mobilenetv3
Tác giả: Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan
Nhà XB: Proceedings of the IEEE/CVF international conference on computer vision
Năm: 2019
[19] Thomas G. Kieninger. Table structure recognition based on robust block segmentation. In Daniel P.Lopresti and Jiangying Zhou, editors, Document Recognition V, volume 3305, pages 22 – 32. Inter- national Society for Optics and Photonics, SPIE, 1998 Sách, tạp chí
Tiêu đề: Document Recognition V
Tác giả: Thomas G. Kieninger
Nhà XB: International Society for Optics and Photonics
Năm: 1998
[20] Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019 Sách, tạp chí
Tiêu đề: An energy and gpu-computation efficient backbone network for real-time object detection
Tác giả: Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, Jongyoul Park
Nhà XB: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops
Năm: 2019
[21] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition. arXiv preprint arXiv:1903.01949, 2019 Sách, tạp chí
Tiêu đề: Tablebank: A benchmark dataset for table detection and recognition
Tác giả: Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li
Nhà XB: arXiv
Năm: 2019
[22] Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979 Sách, tạp chí
Tiêu đề: A threshold selection method from gray-level histograms
Tác giả: Nobuyuki Otsu
Nhà XB: IEEE Transactions on Systems, Man, and Cybernetics
Năm: 1979
[23] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cas- cadetabnet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 572–573, 2020 Sách, tạp chí
Tiêu đề: Cas-cadetabnet: An approach for end to end table detection and structure recognition from image-based documents
Tác giả: Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, Kavita Sultanpure
Nhà XB: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops
Năm: 2020
[29] Prof. Dr. Faisal Shafait. Table ground truth for the uw3 and unlv datasets (dfki-tgt-2010). http://tc11.cvc.uab.es/datasets/DFKI-TGT-2010_1 Link

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN