Skew estimation for document images

Therefore, this thesis describes our research on not only skew estimation but also several closely related topics that include page segmentation, convex hull analysis, text and graphics

Trang 1

SKEW ESTIMATION FOR DOCUMENT IMAGES

YUAN BO ( !) (M.Sc., NUS, Peking; B.Sc., Peking)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE

SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE

2005

Trang 2

Acknowledgements ii

Acknowledgements

I would like to thank my academic advisor Professor Tan Chew Lim (∀#∃)

of Computer Science for his invaluable guidance and encouragements throughout the many years of my graduate studies in Computer Science He provides me with the best

of his knowledge and research equipments that make my research smooth and fruitful

I am always grateful to Professor Tang Seung Mun (%&∋) of Physics for providing me with many opportunities to widen my academic views and enrich my personal life Without his help, I would not have been able to achieve what I have today

I own so much to my wife Xiaojing (()) for her extraordinary hard work for the family and her constant support for my study in the years past I could never really understand how difficult it is to be a wife with a successful career and two young children, our daughter Xinyi (∗+) and our son Xinran (∗,), with almost no external help

My parents year after year longed for my being the first Ph.D in the family Now the time comes, even though they hardly understand the work I have done To them, the degree itself is glorious enough to reward their decades of sacrifices to bring

me up and give me highest possible education

Trang 3

Table of Contents iii

Table of Contents

ACKNOWLEDGEMENTS II TABLE OF CONTENTS III SUMMARY VI LIST OF FIGURES VIII LIST OF TABLES XII

CHAPTER 1 INTRODUCTION 1

1.1 Digital and Analog Publications 1

1.2 Motivations and Contributions 4

1.3 Organization of This Thesis 7

CHAPTER 2 RELATED WORK 9

2.1 Related Work on Skew Estimation 11

2.1.1 Projection-profile based skew estimation class 12

2.1.2 Hough-transform based skew estimation class 13

2.1.3 Nearest-neighbor clustering based skew estimation class 14

2.1.4 Morphological operation based skew estimation class 15

2.1.5 Spatial frequency based skew estimation class 16

2.1.6 Other approaches of skew estimation methods 17

2.2 Related Work on Page Segmentation 18

2.2.1 Connected component analysis based 18

2.2.2 Projection profile based 20

2.2.3 Morphological operation based 21

2.2.4 Background based 21

2.2.5 Other approaches 22

CHAPTER 3 SKEW ESTIMATION FROM FIDUCIAL LINES 23

3.1 Skew Estimation 24

3.1.1 Histogram generation 25

3.1.2 Peaks searching 28

3.1.3 Results verification 32

3.1.4 Working on component holes – the background mode 32

Trang 4

Table of Contents iv

3.2 Speedup Measures 34

3.2.1 Alternative histogram configuration 35

3.2.2 Filters for individual components 35

3.2.3 Filters for component pairs 35

3.2.4 Faster slope-to-angle calculation 38

3.2.5 Skew-independent segmentation 38

3.3 Experimental Results 38

3.3.1 Synthetic images (total 168 from UW-I) 39

3.3.2 Scanned images (total 979 from UW-I) 40

3.3.3 Scanned images from Chinese newspaper clips 42

3.4 Conclusion 43

CHAPTER 4 SKEW ESTIMATION FROM CONVEX HULLS 60

4.1 Components Grouping 61

4.1.1 The proposed grouping function 62

4.1.2 The advantages of using convex hulls 65

4.1.3 The choice of the parameter k 67

4.1.4 The detection-loop with k feedback 70

4.2.1 The detection of parallel/perpendicular edges of convex hulls 72

4.2.2 The accumulator array for the edge slopes 73

4.2.3 The search for peaks 74

4.3.1 Synthetic images (total 168 from UW-I) 75

4.3.2 Scanned images (total 979 from UW-I) 76

CHAPTER 5 SKEW ESTIMATION FROM STRAIGHT EDGES 90

5.1.1 The edge enhancement 92

5.1.2 The Wallace parameterization 93

5.1.3 The probe-line sweeping and scanning 94

5.1.4 The perpendicular criterion 96

5.1.5 The speedup measures 98

CHAPTER 6 COMPARISONS AND DISCUSSIONS 106

6.1 Suite Tests 107

6.1.1 Suite tests using UW-I 107

6.1.2 Suite tests using “DOE samples” 110

6.2 Feature-by-Feature Comparisons 113

6.2.1 Excessive noises 113

6.2.2 Multiple skews 115

6.2.3 Non-textual documents 117

6.2.4 Other issues 118

Trang 5

Table of Contents v

6.3 Complementary Features 119

6.3.1 The choice of centroids 120

6.3.2 Divide and conquer 121

6.4 Future Work 121

ANNOTATED BIBLIOGRAPHY 129

APPENDIX 149

Trang 6

Summary vi

Summary

This thesis represents our efforts in developing a series of skew estimation models for scanned document images These three self-contained yet complementary

models include the fiducial-line based which relies on the existence of text lines, the

convex-hull based which relies on the existence of paragraphs or columns, and the straight-edge based which relies on the existence of non-textual components that have

straight edges or lines The objective is to tackle some of the major problems that still challenge document analysis systems at present: excessive interfering noises; multiple skews and their locations in a single document; skew estimation for non-textual documents

The first model is based on fiducial lines A fiducial line is defined as the virtual straight line that passes through the centroids of any two components For textual document images, the fiducial lines of the components comprising the text lines have highly concentrated slopes Any out-of-the-text-lines component pairs have their fiducial lines spread widely across the slope histogram The central values of the highest peaks in the histogram are taken as the estimated skews of the image This proposed model works very well for the document images with excessive noises without requiring the separation of the textual components from non-textual ones Speedup measures for this model’s baseline implementation are provided

The second model is based on convex hulls A convex hull is defined as the smallest virtual convex polygon that fully contains a component or a group of

Trang 7

The third model is based on straight edges The straight edges or lines in an image include the separators of tables, the borders of graphical inserts, the black bars around the borders and the center spine of bounded materials, etc This proposed model first applies an edge detector on an image to highlight the borders Then, it uses

a line-probing algorithm in the same configuration of the Wallace parameterization for the Muff Transform Any significant straight edges or lines will be identified and used

as the basis for skew estimation Various strategies for optimized line probing are devised This proposed model is applicable to both textual and graphical documents scanned with ordinary scanners or copiers under normal conditions

The performance of these models are evaluated using the full set of 168 synthetic and 979 real images from the University of Washington English Document Image Database I (UW-I)

Trang 8

List of Figures viii

among the centroids of components 24 Figure 3.2 The fiducial lines are drawn on the image in Figure 3.1 along the angles of 1.72±0.02°.

25 Figure 3.3 The flowchart of the fiducial line based skew estimation model 26 Figure 3.4 The slope histogram of the fiducial lines for the image in Figure 3.1 The prominent

spikes at 0°, -90°, 45°, etc are the results of the quantization effects and can be removed by the convolution as proposed in this chapter 27 Figure 3.5 The slope histograms for separated inter-/intra-line components of the image in

Figure 3.1 The contributions from intra-line components form a broad background, while the contributions from inter-line components form an easily recognizable sharp peak 27 Figure 3.6 Distance versus angle plot for the grid points in a squared imaging grid The

quantization effects are obvious in short distances, especially along ±90° or tg -1 ( ± ),

0 ° or tg -1 (0), ±45° or tg -1 (±1), ±63.44° or tg -1 (±2), ±26.56° or tg -1 (±1/2) 29

Figure 3.7 The convolved histogram of Figure 2.3 with the kernel shown as inset (σ=0.5°, not to

scale) 30 Figure 3.8 The convolved histograms for the individual lines of the image in Figure 3.1 The

histogram in Figure 2.6 is the addition of all the nine lines 31 Figure 3.9 Working on the component holes along the angles of 1.78±0.02° for the image in

Figure 3.1 The holes are extracted with 4-connnectedness on the background The major white spaces have no effect on the skew detection (removed here for cleaner presentation) 33 Figure 3.10 The convolved histogram in the background mode for the image in Figure 3.1

(σ=0.5°) The S/N ratio and peak accuracy are both lower than in the foreground

mode, but still usable even for the skew detection for low-resolution or down-sampled images 34 Figure 3.11 Design of distance filter for component pairs using the image in Figure 3.1 The

dense, stripe-like central pattern is formed by the intra-line component pairs, while the other hyperbola-shaped patterns are from inter-line pairs 36 Figure 3.12 Design of the size-difference filter for component pairs using the image in Figure 3.1

The major contributions to the central peak are from the component pairs whose size differences are less than 100 pixels 37 Figure 3.13 The accumulated percentage of samples versus the absolute error on the 168

synthetic document images in UW-I Each of the images in the database is randomly rotated three times in the range of [0, 90°), resulting in 504 rotated images 39 Figure 3.14 The accumulated percentage of samples versus the absolute error on the 979 real

document images in UW-I 41 Figure 3.15 Regression analysis using the real document images in UW-I The linear correlation

coefficient is 92.80% The details of the labeled outliners are shown from Figure 3.26

to Figure 3.30 42 Figure 3.16 The sample H04I from UW-I The ground-truth is -0.1° and the detected skew angle

is 89.82° Fiducial lines are drawn along the angles of 89.82±0.02° 45

Trang 9

List of Figures ix

Figure 3.17 The raw (top) and the convolved (bottom) histograms of the sample H04I from

UW-I The detected skew angle is -0.18° (89.82° - 90°) The peak at 0.1° is from the

horizontal text lines at the bottom of the page 46 Figure 3.18 The sample D03E from UW-I The ground-truth is -0.21°, and the detected skew

angle is 0.16° Fiducial lines are drawn along the angles of 0.16±0.02° 47 Figure 3.19 The raw (top) and the convolved (bottom) histograms of the sample D03E from UW-

I The detected skew angle is 0.16° 48 Figure 3.20 The sample E01E from UW-I The ground-truth is -0.04°, and the detected skew

angle is -0.04° Fiducial lines are drawn along the angles of -0.04±0.02° 49 Figure 3.21 The raw (top) and the convolved (bottom) histograms of the sample E01E from UW-

I The detected skew angle is -0.04° 50 Figure 3.22 The sample H00L from UW-I (labeled in Figure 3.15) The ground-truth is 1.47°,

and the detected skew angle is 1.36° Fiducial lines are drawn along the angles of 1.36±0.02° This is one of the images that contain sparse text 51 Figure 3.23 The raw (top) and the convolved (bottom) histograms of the sample H00L from UW-

I The detected skew angle is 1.36° The S/N ratio is only 19.68dB 52 Figure 3.24 A scanned Chinese newspaper clip Fiducial lines are drawn on the original image

(top-left) along the angles of 0.04±0.02° (top-right), 50.44±0.02° (bottom-left) and 89.86±0.02° (bottom-right) 53 Figure 3.25 The raw (top) and the convolved (bottom) histograms of the Chinese newspaper clip

There are multiple prominent peaks in the convolved histogram due to the special style of Chinese text 54 Figure 3.26 The sample A03I from UW-I (labeled in Figure 3.15) Fiducial lines are drawn on the

original image along the angles of 0.90±0.02° (top, detected) and -0.65±0.02° (bottom, ground truth) The images are rotated 90° counter-clockwise The detected skew angle

is for the dominant left page, while the ground truth may be for the right page 55 Figure 3.27 The sample A03J from UW-I (labeled in Figure 3.15) Fiducial lines are drawn on

the original image along the angles of -0.54±0.02° (top, detected) and 0.81±0.02° (bottom, ground truth) The images are rotated 90° counter-clockwise The detected skew angle is for the dominant right page, while the ground truth may be for the left page 56 Figure 3.28 The sample A05G from UW-I (labeled in Figure 3.15) Fiducial lines are drawn on

the original image along the angles of 0.48±0.02° (top, detected) and -2.12±0.02° (bottom, ground truth) The images are rotated 90° counter-clockwise The detected skew angle is for the dominant right page, while the ground truth is doubtful 57 Figure 3.29 The sample N03I from UW-I (labeled in Figure 3.15) Fiducial lines are drawn on the

original image along the angles of -0.64±0.02° (top, detected) and 0.25±0.02° (bottom, ground truth) The images are rotated 90° counter-clockwise This is a false detection case that is caused by the cross-the-column correlation when the text lines in different columns are not collinearly aligned Many skew detectors suffer from such kind of samples 58 Figure 3.30 The sample S021 from UW-I (labeled in Figure 3.15) Fiducial lines are drawn on the

original image along the angles of 0.58±0.02° (top, detected) and -1.00±0.02° (bottom, ground truth) The images are rotated 90° counter-clockwise This is another false detection case that is caused by the cross-the-column correlation when the text lines in different columns are not collinearly aligned 59 Figure 4.1 The convex hulls of the components with their vertices and centroids marked This is

a clip of the image A00O from UW-I 61 Figure 4.2 The convex hulls of the component groups with their vertices marked This is a clip of

the image A00O from UW-I 62 Figure 4.3 A reference implementation of the components grouping algorithm in pseudo code In

principle, this is a partition algorithm with a binary predicate 64 Figure 4.4 The areas distributions of the components (top) and their convex hulls (bottom) of the

image A00O from UW-I The background is the original image represented by its components (top) and convex hulls (bottom) of the corresponding half 66 Figure 4.5 The areas distributions of the components (top) and their convex hulls (bottom) of a

Chinese newspaper clip The background is the original image represented by its components (top) and convex hulls (bottom) of the corresponding half 68

Trang 10

List of Figures x

Figure 4.6 Various components grouping stages for the image A00O: (foreground) the weighted

density of the groups versus the k value; (background) the component groups and

their convex hulls at k = 6, 12 and 20 69

Figure 4.7 The frequency distribution of the smallest k at which the formation of the paragraphs stabilizes for the 979 real images and the 168 synthetic images from UW-I Using the convex hulls of the components (bottom) is superior to using the components directly (top) 71

Figure 4.8 The flowchart of the convex hull based skew estimation model 73

Figure 4.9 The accumulated percentage of samples versus the absolute error on the 168 synthetic document images in UW-I Each of the images in the database is randomly rotated three times in the range of [-45°, 45°), resulting in 504 rotated images 76

Figure 4.10 The accumulated percentage of samples versus the absolute error on the 979 real document images in UW-I 77

Figure 4.11 Regression analysis using the 979 real document images in UW-I The linear correlation coefficient is 92.1% 78

Figure 4.12 The sample A002 from UW-I (labeled in Figure 4.11) The ground truth is 0.4°, and the detected skew angle is -2.54° (highest peak) for the left half and 0.28° (second highest peak) for the right half of the image The components in gray are those filtered out by the size filter or the aspect-ratio filter, while the components in black are those grouped by the grouping function The edges and vertices of their convex hulls are drawn in gray 82

Figure 4.13 The sample A03I from UW-I (labeled in Figure 4.11) The ground truth is -0.65°, and the detected skew angle is 1.06° 83

Figure 4.14 The sample A05G from UW-I (labeled in Figure 4.11) The ground truth is -2.12°, and the detected skew angle is 0.14° The ground truth is doubtful in this case 84

Figure 4.15 The sample J00B from UW-I (labeled in Figure 4.11) The ground truth is -0.48°, and the detected skew angle is 0.95° for the left page and -0.52° for the right page The most prominent peak is for the left page The value of the parameter k has been increased from 16 to 35 85

Figure 4.16 The sample N042 from UW-I (labeled in Figure 4.11) The ground truth is 0.79°, and the detected skew angle is 0.0° This sample reveals the limitation of angular resolution at short distances, which is true for any skew estimation method 86

Figure 4.17 The sample H04I from UW-I The ground truth is -0.10°, and the detected skew angle is -0.19° This is one of the samples that demonstrate the robustness of the proposed skew estimation method in the presence of excessive noises 87

Figure 4.18 The sample I047 from UW-I The ground truth is 0.50°, and the detected skew angle is 0.25° This is one of the samples that demonstrate the robustness and versatility of the convex hull based model in selecting hints for skew estimation 88

Figure 4.19 The sample A06M from UW-I The ground truth is -3.00°, and the detected skew angle is -2.75° The warping along the spine of the original document does not impede the correct detection of the skew angle 89

Figure 5.1 The scanned pages with black bars and table dividers (left), photographic inserts (center), and field dividers (right) 91

Figure 5.2 The Wallace parameterization [54]., where w and h are the width and height of the image, respectively The line from S1 to S2 is a probe-line Note that S2 is always greater than S1 in this configuration in order to achieve unique probe-lines 92

Figure 5.3 The flowchart of the straight edge based skew estimation model 94

Figure 5.4 The possible range of the unique (S 1 , S 2) pairs in the Muff space shown in the shaded areas, where w and h are the width and height of the image The total size of the shaded areas is 6wh + (h-w)2 An area marked “edge” is on one of the four edges of the image, thus are not useful An area marked “dup” is a duplicated area of the symmetric one related to the diagonal 95

Figure 5.5 The perpendicular criterion for any two probe-lines 97

Figure 5.6 Detection result for the image A00G (ground truth: 0.95°, detected: 1.46°) 102

Figure 5.7 Detection result for the image A002 (ground truth: 0.40°, detected: 0.35°) 103

Figure 5.8 Detection result for the image H04I (ground truth: -0.10°, detected -0.13°) 104

Figure 5.9 Detection result for the image D053 (ground truth: 0.02°, detected: 0.09°) 105

Trang 11

List of Figures xi

Figure 6.1 Some of the “DOE samples” used in the tests by Bagdanov et al , which are

intentionally down-sampled by the original authors due to the sensitivity of their contents 111 Figure 6.2 Charts of the cumulative probability versus the absolute skew angle estimation error

(in degrees) from Chen et al at University of Washington using the 168 synthetic and the 979 real images from UW-I: (a) with the hand-tuned optimal parameters and the 2×3 box; (b) with the hand-tuned optimal parameters and the 2×2 box; (c) in

automatic mode with the 2×3 box; and (d) in automatic mode with the 2×2 box (Source: figures 1 and 3 in ref [38]) 123 Figure 6.3 Charts of the cumulative percent of samples versus the absolute error from

Bloomberg, Kopec and Dasari at Xerox Palo Alto Research Center using the images from UW-I: (a) on the 11 selected synthetic images; (b) on the 979 real images

(Source: figures 3 and 6 in ref [13]) 124 Figure 6.4 The 3-party performance evaluation using the real samples from UW-I See Table 6.1

for the numerical values and the uncertainty of the data acquisition process 125 Figure 6.5 Regression analysis on Postl’s method [6] using 460 images selected from the internal

DOE samples The linear correlation coefficient is 82.9% (Source: Figure 4 in ref [103]) 126 Figure 6.6 Regression analysis on Baird’s method [7] using 460 images selected from the internal

DOE samples The linear correlation coefficient is 93.7% (Source: Figure 4 in ref [103]) 126 Figure 6.7 Regression analysis on Nakano’s method [10] (“sharpness” measure #3) using 460

images selected from the internal DOE samples The linear correlation coefficient is 89.0% (Source: Figure 4 in ref [103]) 127 Figure 6.8 Regression analysis on Nakano’s method [10] (“sharpness” measure #4) using 460

images selected from the internal DOE samples The linear correlation coefficient is 90.6% (Source: Figure 4 in ref [103]) 127 Figure 6.9 The convolved slope histogram produced from using the convex hull based fiducial

lines on the image in Figure 2.1 128 Figure 6.10 Convex hull based fiducial lines drawn on the image in Figure 2.1 along the angles of

1.74±0.02° 128

Trang 12

List of Tables xii

List of Tables

Table 6.1 Performances comparison using the 979 real document images in UW-I

Shaded rows are the best performers from Chen, Bloomberg and Yuan

(Charts digitization uncertainty: ±0.5%) 109 Table 6.2 Feature comparison among the three skew estimation models in this thesis and

the three popular approaches 114

Trang 13

Chapter 1 - Introduction 1

Skew estimation is the main theme of this thesis However, skew estimation has never been an isolated study It is one of the important constituents of the preprocessing stages for document image analysis and understanding Therefore, this thesis describes our research on not only skew estimation but also several closely related topics that include page segmentation, convex hull analysis, text and graphics separation, patterned line detection, and others that involve computer graphics and geometry There are also new techniques that have been developed and applied to the research work of this thesis but not specifically described, such as multi-level image thresholding, recursive connected component analysis, and connection-map based shape identification and retrieval

1.1 Digital and Analog Publications

Digital publications revolutionize the publishing industry and our society in many ways One of the most desired features is its automated content indexing and searching capability New publications are almost all produced in computers The quality of output is unprecedented This is in sharp contrast to the immense difficulty involved in typesetting, printing, cataloging, content searching, and preserving of analog publications such as paper prints or microfilm archives Therefore, many efforts have been put into the research and development of new technologies to convert the legacy analog publications into the new electronic form [1] The ultimate goal is to integrate the analog publications with the new digital publications to form a

Trang 14

unified computerized archiving system

This technology shift faces many obstacles on both the machine side and the human side On the machine side, the major challenge is the computer recognition of textual and graphical components, together with positional information or layout, in the legacy analog publications The involved tasks are categorized in Figure 1.1 according to the processing levels

The first step in computer recognition of the analog publications is the digitization process The analog publications are optically converted from the continuous image intensity into discrete binary values in order for computers to store and process Even this primary step can offer some useful advantages over analog form, such as theoretically unlimited document lifetime, perfect duplication, ease of document browsing in large quantity, remote access and distribution, etc In fact, there are working systems that store high-quality scanned document images (with

Geometric Primitives Recognition

Thresholding, Morphological Operations

Geometric Primitives Recognition

Thresholding, Morphological Operations

Figure 1.1 The different levels of tasks in document image analysis

Trang 15

compressions of some sort) in huge CD-ROM arrays controlled by computers Users can find particular pages and issues of the publications, view it on local screen or transmit down the computer network Systems of this caliber, however, have some severe shortcomings The most prominent one is their inability to do automated content search and indexing, even though they do provide page and volume serial numbers that presented in the original analog documents

The next step is the feature identification and extraction The textual features are extracted from the scanned documents images by means of optical character recognition The graphical features are extracted by means of feature detection Pixel level processing on the scanned images is generally required before the component extraction being able to yield correct results With the positional information of both textual and graphical components identified and registered, the resultants are saved in open-standard or proprietary document formats or databases At this stage, legacy analog publications and modern digital publications are truly unified and can be processed exactly the same way Thus the much desired feature of automated context-sensitive searching for analog documents is implemented Users specify the text to be searched or the graphical element to be matched or both, set the search scope, then the archiving system identifies the matching textual and graphical context, refers the users

to the locations of these findings, and the users can manually narrow down the list to what they intend to find

The searching and indexing capabilities of the archiving systems can further be improved by incorporating knowledge domain based search engines For example, thesaurus-like databases actively collect or passively cache closely related keywords and graphical features so that the accuracy and efficiency of search can be greatly improved

Trang 16

1.2 Motivations and Contributions

Skew estimation, as an important preprocessing stage for the conversion from analog publications to their digital counterparts, has matured since the early works in the mid-1980s The research publications and patent documents are very rich in

Figure 1.2 Some of the scanned document images that contain excessive noises,

multiple skews, sparse text or short text block, scanning artifacts, and so on They are still serious challenges to any document analysis systems

Trang 17

representing the large amount of different approaches However, there are still some difficult problems left unsolved in the research and applications of skew estimation and correction for scanned pages, such as the existence of excessive noises, multiple skews on the same scanned page, sparse text or short text block, scanning artifacts, etc Figure 1.2 shows some of the scanned documents samples with such problems These problems severely affect the smooth operation of the current document image processing software systems

Today, most OCR software can handle skewed documents before doing OCR However, if there are multiple skews, the OCR engines will only work on textual areas with the estimated predominant skew For document images that contain excessive noises, the performances of the existing methods that rely on the predominant alignment of the text lines for skew estimation will deteriorate or even fail Faxed documents are often badly skewed with many isolated noises To prevent the noises from interfering with the normal operation of the skew estimation procedure, the current methods typically try to filter out graphics before working on the text itself However, designing general-purpose graphics/text filters are difficult if not impossible For document with predominant graphics or entirely graphics, the current filters will remove most graphics, leaving little or almost no text for finding skew This leads to the problem that sparse text or short text blocks are difficult for many existing skew estimation methods because they need sufficient amount of text

as clues for the skew detection With all these problems, we are facing now the challenge to develop a set of techniques to handle them

The major contributions of this thesis are the three skew estimation models and the associated techniques to address the problems discussed above

We coin the term “fiducial line” as a generalization of the term “fiducial

Trang 18

point” as proposed by Spitz [11] This term represents the unique use of slope histogram to detect collinear components in a document image When applied on document images with excessively noises, this model overcomes the difficulties that the other methods failed such as the projection-profile based or that using linear regression with median estimators This model has been published as appeared in Refs [107] and [108]

We coin the term “convex hull analysis” after the term “connected component analysis” which has already widely used in the literatures of pattern recognition Convex hull is one of the fundamental concepts in Computational Geometry This term represents the unique use of the convex hulls of components and component groups for geometric properties extraction from textual document images This model uses convex hulls of various levels of components provides a viable way of detecting multiple skews with their locations in a single image This model has been published

as appeared in Ref [109] to [112]

The Muff transform has been proposed by Wallace [54] as a variation of the Hough transform We use Wallace’s parameterization scheme only for straight edge or line detection This model has the advantages of being able to detect constrained or patterned edges or lines with implementation flexibility and efficient memory use Its application on scanned documents with the ubiquitous black bars at the borders makes this undesirable artifact a beneficial feature This solution in software is particularly attractive for use with commodity scanners and copiers Since this model works in graphics mode, it also solves the problems of sparse text or heavy graphical noises This model has been published as appeared in Ref [113]

Even though the number of methods on skew estimation in the research literature is large, there are only a limited number of comparative evaluations or group

Trang 19

tests using large scale test sets that are well accepted by the research community We compile some quantitative test results that involve two of our models in this thesis, a method from Chen et al [37], a method from Bloomberg et al [13], a method from Postl

[6]

, a method from Baird [7], and a method from Nakano et al [10] The test sets used are the well-established databases from University of Washington at Seattle and University of Nevada at Las Vegas The tests are carried out either by the original authors including us or by Bagdanov et al [103] This compiled quantitative comparisons offer some insights into the designs of skew estimation methods, the testing methodologies and the test databases For the purpose of future open public group tests by any interested parties, we publish the complete numerical results in print form as well as distributable digital forms

1.3 Organization of This Thesis

This thesis is organized in chapters as follows

Chapter 1 gives an overview of the current research focus on the development

of digital documents, the significance of the conversion from the legacy analog documents to the current digital form, and the importance of the research on skew estimation in this thesis as part of the analog to digital conversion process A brief summary of our contributions to some of the major persistent problems of skew estimation and the organization of this thesis are also given

Chapter 2 reviews most of the major research works in skew estimation and page segmentation especially those that are closely related to this thesis

Chapter 3, Chapter 4, and Chapter 5 present the three major contributions of this thesis work, namely the skew estimation models based on fiducial lines, convex hulls, and straight edges for document images A brief discussion is given in each

Trang 20

chapter to highlight the principles and the implementations of the specific model The detailed discussions and comparisons among the different methods, either from the literature or from this thesis, are discussed in Chapter 6

Chapter 6 gives the quantitative analysis and feature-by-feature comparisons among our models and several established methods in the research literature Also shown are the complementary features of our models that one can offer to improve the performance of another Some possible future works for the current models and ideas are sketched in this chapter

The “Annotated Bibliography” compiles the publications that are referenced from this thesis Many of the entries in this list are among the most influential research works and patents of inventions in the related fields of this thesis We give some brief comments for each entry for the bookkeeping purposes Our publications

in the international refereed journals and conference proceedings are also listed at the end

Finally, the “Appendix” publishes the complete seven pages of the numerical experimental results produced by two of the models in this thesis The purpose is to create an opportunity for possible open tests by any interested researchers using the UW-I database Digital forms are provides elsewhere as deem appropriate

Trang 21

Chapter 2 - Related Work 9

Printed documents are customarily rectangular Ideally, text lines in the documents are either horizontal or vertical relative to the edges of the document pages Due to the imprecision or difficulty in the placement of the original documents

on the surfaces of the scanners or copiers, the edges of the captured pages an image may not always align precisely with that of the image This amount of misalignment

or offset is usually referred to as the skew angle of a document image Skew estimation

is the process of detecting the skew angles and their specific locations in an image for the subsequent correction

Skew estimation is one of the important constituents in document image

analysis Document Image Analysis (DIA), as defined by Nagy [2], “is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer” DIA is a task that consists of the pre-processing, the central processing and the post processing stages Different researchers classify the processing stages differently, however

The pre-processing may include pixel-level noise filtering which intends to

eliminate the pixels that are unlikely of representing useful information based mainly

on their sizes, binarization which defines the foreground pixels (information) and the background pixels (separator) of the input grayscale images, connected component

analysis which extracts spatially related foreground pixels as separate entities

(components) based on their connectivity (usually 4 or 8), skew estimation which

Trang 22

detects the offsets of the edges of a page from the horizontal/vertical directions (main

interest of this thesis), edge extraction which locates the foreground pixels of a

component that has neighboring background pixels, textured background analysis which identifies the foreground pixels from the background pixels that have

collectively show certain geometric patterns, text/graphics separation which identifies

the textual or graphical attributes from the groups of components based on geometric

or statistic properties of their pixels, page segmentation which partitions an image into

regions or blocks of components that may be treated separately based on their similarities in shape or space or other properties

The central processing stage may include Optical Character Recognition (OCR) and special symbol recognition such as that of engineering drawings This is the stage where the pixels or components are translated into some predefined symbols that correspond to the characters of certain natural language Character segmentation, character scaling, and font/style recognition can also be considered as being parts of the OCR process This stage converts an analog document on paper for humans to read into a digital document in codes in order for computers to process

The post processing stage may contain higher levels of processing, such as functional or linguistic analysis of the identified symbols in the two dimensional arrangement in the page segmentation process It is important for natural language processing to be applied to interpret the meanings that have been carried over from the original analog documents by these symbols and their positions

Here, we can see that the skew estimation is very important to the success of other processes involved within a DIA system In many cases, skew estimation is integrated with other processes to achieve better results or solve specially intended problems This fusion point is where many new research works on skew estimation

Trang 23

join in with new approaches

2.1 Related Work on Skew Estimation

Due to the wide spread use of digital scanners and copiers, skew estimation and correction for scanned document images has triggered some extensive studies and

a large array of techniques has consequently been developed Different skew estimation methods compete on the aspects of detection accuracy, time and space efficiencies, abilities to detect the existence of multiple skews in the same image, and robustness in noisy environments and scan-introduced distortions

There are several in-depth reviews and surveys [2]-[5], as well as a comparative performance evaluation of some selected skew estimation methods [14] available in the research literature Besides the journal publications and conference proceedings, patent documents are another rich source of information that in many cases is even more comprehensive in the sense of details and completeness

In a textual document image, there are various hints of skews The most explored hint of skew is the existence of straight text lines Skew estimation methods first try to locate the portion of document that the hints of skew reside, then the use various strategies to approximate these text lines in order to derive the possible skew angles

Based on the deployed strategies, the most popular skew estimation methods can be grouped into the following six distinctive classes: the projection-profile based

Trang 24

2.1.1 Projection-profile based skew estimation class

Projection profiles are probably the earliest used method for skew estimation, and it continues to be the most popular choice for document analysis systems

Typical projection profile based skew estimation methods use a single point, called a “fiducial point” by Spitz [11], to represent each feature in an image, such as the bottom center of the bounding box of a connected component as used by Baird [7] The fiducial points of all the features are projected onto a 1-D accumulator array along an angle and a function, called the “premium function” by Baird [7], which is able to reveal the alignments of the fiducial points, is evaluated on the accumulator array If the projection direction is rotated successively in a number of predefined angles, a series of projection profiles are obtained The premium function should reach the extremes when the projection direction is along the text lines To speedup the detection, the projections can be done first in large intervals of angles, and then in smaller intervals along the most likely directions that have been detected in the first round

Another choice of fiducial points is the foreground pixels as used by Postl [6]and many people after him Along a given direction of projection, all the foreground pixels are added into the accumulator array as the profile for this angle It is obvious that the projection profile has the lowest counting along the lines that correspond to the white spaces that separate text lines The premium function can be used in a similar way as Baird does

Bloomberg [13] uses the same approach as Postl He first sub-samples the full resolution image in the two dimensions by a factor of 2, 4, or 8 for the purpose of reducing the processing time He uses a premium function that is the sum of square of differential counts on adjacent lines Note that the counts are the total number of

Trang 25

foreground pixels along a projection line raised to a power of 2 He uses both the sweep and bisection methods to locate the maximum in the profile He then suggests that the multiple passes be used from coarse to fine resolutions In his paper, he also presents a method for page orientation determination, because the portrait or landscape orientation cannot be determined from the projection profiles

What interests us the most in his paper is that he uses the full set of images in UW-I to evaluate his skew estimation method His paper also provides some details of his method in dealing with certain more difficult images in UW-I In section 6.1.1

“Suite tests using UW-I”, we will make quantitative comparisons among the participants including his method, Chen’s [38] and two models from our research team

2.1.2 Hough-transform based skew estimation class

Typical Hough-transform [18] based skew estimation methods select a set of

fiducial points {x, y} to represent the components and then maps them to the

parameter space (Hough space) with certain parameterization If the normal

parameterization is used (ρ = x cosθ + y sinθ), a single fiducial point in the x-y image space is mapped to a sinusoidal curve in the quantized ρ-θ parameter space by scanning the whole range of parameter θ If the mapped curves are accumulated in the 2-D parameter space, the global maxima {ρ max , θ max} correspond to the prominent text line orientations of the image Since the parameter-mapping process is very time-consuming, all the Hough-Transform based methods make aggressive computation reduction in order to achieve acceptable performance for real world documents, usually with compromises in accuracy and other properties The most used computation reduction strategy is to sub-sample the input images [22], usually more than once in a hierarchical manner

The choice of fiducial points is either the foreground image pixels, or certain

Trang 26

feature points in the components or runs Different choices reflect different strategies

in the straight line feature representation and computational cost reduction Srihari et

al [19] directly use the foreground pixels Hints et al [20] use the bottom pixel of a vertical foreground run Le et al [21] use the pixels of the last row of a textual block Ham et al [23] use the extracted edges of the blocks formed by constrained run-length encoding Yu et al [24] use the centroids of connected components Min et al use [25] a divided horizontal histogram method which improves from that of Hints et al Pal et al

[26]

first extract the bounding boxes of the connected components then use the leftmost pixels of the uppermost runs and the rightmost pixels of the lowermost runs of the bounding boxes of those that most likely to be characters There are other works [29][30]that use even more variations of the above mentioned choices

2.1.3 Nearest-neighbor clustering based skew estimation class

Typical nearest neighbor clustering (k-NN) based skew estimation methods

explore the spatial clue to establish component pairs that are further linked to component groups, which are supposed to belong to a text line In each group, the fiducial points of the components are used to approximate the orientation of the text line they are supposed to belong For the skew angle of the whole page, the estimated skew angles of the individual groups are usually subject to certain weighted average evaluation

There are several algorithms for the nearest neighbors searching Hashizume et

al [31] take only 1-NN pairs from a proximity tree (a minimum spanning tree) that contains all the components O’Gorman [32] takes 5-NN pairs from directly using the

general k-NN algorithm with the well-known optimization of first sorting the components by the x-coordinates of their fiducial points Other works [33][34][35] use similar techniques can also be found

Trang 27

There are several algorithms for deriving the orientations of the text lines that the component groups represent Hashizume et al [31] use a histogram to accumulate the directions of the 1-NN pairs then directly search the peak to locate the dominant skew angle O’Gorman [32] takes a similar approach for the 5-NN pairs from smoothed histogram Smith [33] uses line fitting with a range estimator to avoid the influences from the outliners

Since the nearest neighbor clustering based skew estimation is carried out in local groups where the distances among the fiducial points of the components are short, the precision of the estimated skew angle is usually not as high as the methods that work on global scale where the distances among the fiducial points are longer

2.1.4 Morphological operation based skew estimation class

Typical morphological operation based skew estimation methods use pixel level morphological operations to group or erase neighboring foreground pixels in order to form structures that represent the text lines

In the method of Chen et al [37], an image is first down-sampled (from 300 dpi

to 100 dpi for the images in UW-I) to reduce the computational cost and at the same time achieve the feature blurring effect Then, a special morphological closing operation with a structuring element is applied to the down-sampled image to fill the gaps in between neighboring characters By controlling the closing threshold value, most of the inter-character gaps are closed while most of the inter-line gaps still exist

In the next step, a special morphological opening operation is applied to the closed image to remove the ascenders and the descenders of the characters as well as the

“over-fills” By controlling the opening threshold value, the resultant image has only long, smooth black stripes left in place of the original text lines After that, the connected component analysis is applied to the image to extract the black stripes for

Trang 28

the subsequent least square line fitting Since the black stripes are irregular in shape, the least square fitting finds the best lines that usually deviate from the true text line directions They assume that the fitted lines have a prior normal distribution In the scatter plot given in their paper (Figure 1 in ref [37]), the distribution spreads over a range as wide as 8° (and does not closely resemble a normal distribution in our opinion) They use the median of this plot plus a small variance (another parameter) to select the “good” lines for estimating the “optimal estimate of the test skew angle” in

a Bayesian framework The test images are totally 12617 = 11 × (168 + 979) from UW-I The factor 11 represents the original images plus additional images created from rotating each images in 10 intervals The estimated skew angles and their ground truth of all the test images are subject to a training process using a regression tree model This gives the optimal values for the 3 parameters: the thresholds for the closing and opening operation, and their structuring elements

We have several concerns regarding their training methodology In machine learning, the training set and the test set should not be the same Therefore, it is unclear how useful these tuned optimal parameters are suitable for the samples other than that of UW-I These concerns will affect the interpretation of the experimental results involving that from Chen, Bloomberg [13], and two of our models in section 6.1.1 “Suite tests using UW-I”

2.1.5 Spatial frequency based skew estimation class

Typical spatial frequency based skew estimation methods treat the text lines in

a textual document image as textures or patterns They use the Fourier transform or other waveforms such as the distributions in Cohen’s class, to reveal such global trend This class of methods usually depends on the availability of dominant text lines They cannot provide the local information that is needed for multiple skews

Trang 29

estimation

Sauvola et al [43] treat the text lines on a page as textures that repeat over on a page They first shrink a textual document image for the purposes of reducing the processing cost and blurring the image to form a texture-like surface with more-or-less homogeneous grayscale areas (“image texturizing” as they call it) Then, the image is convolved with 2 masks to get a gradient map, from which the texture direction is analyzed The unambiguous direction is taken as the skew angle of the image Another gradient direction based method is by Sun et al [44] They not only estimate the skew angle of the text lines, but also predict the slant angle of the characters along a text line

A more interesting spatial frequency based skew estimation is from Aghajan et

al [45] Their novel approach is based on the analogy between a straight line in an image and a planar propagating wave front impinging on an array of sensors A good theoretical analysis is given by Sheinvald et al [46]

Kavallieratou et al [47] use Cohen’s class distributions to estimate the skew angle of a document They first project the page horizontally to form a page profile Then, they apply 7 time-frequency distributions from the Cohen’s class used in signal processing to the horizontal profile The idea is to detect how the amount of the pronounced peaks and dips in the horizontal profile correspond to the change in the frequency distribution with skew angle

2.1.6 Other approaches of skew estimation methods

There are still many other varieties of skew estimation methods which have been published in literature, such as the machine learning based [50], line fitting based

[48][49]

, interline correlation [51], text line accumulation based [52], and so on They are included here to shown the different angles the researchers possibly take to treat the

Trang 30

skew estimation problem

2.2 Related Work on Page Segmentation

The contents of a machine-produced document for human reader are usually arranged in some specific ways This arrangement is called page layout Document image analysis systems need to segment the contents into isolated blocks in a hierarchical manner, such as paragraphs or columns in order to provide local content processing or global functionality classification As an integral processing stage of any document analysis systems, page segmentation is even more important when more than one page of document is scanned into a single image, such as the so-called 2-Up style In the context of skew estimation, certain kind of page layout analysis is inevitable This is the motivation of the proposed component grouping function in Chapter 3 that can process multiple paged images with possible multiple skews

Page segmentation (also known as geometric layout analysis) is one of the two part of the page layout analysis; the other part is block classification (also known as logical layout analysis) We discuss the page segmentation methods in the context of skew estimation in order to emphasize the close relationship between the two Therefore, we mainly categorize the page segmentation methods based on their approaches in the grouping of the textual regions Several references are discussed in both the previous section on skew detection and this section on page segmentation This again shows the close relationship between skew estimation and page segmentation

2.2.1 Connected component analysis based

Typical connected component based page segmentation methods take

bottom-up approach They start from the lowest component level and move bottom-up to higher

Trang 31

component group levels by identifying the lines, paragraphs and columns Since textual components do no have reliable feature to make grouping decisions, they only interest in components that are likely to be text Because of the bottom-up approach, this class of segmentation methods can deal with textual documents with multiple skews

non-O’Gorman’s Docstrum [32] takes this approach His Docstrum, which is an angle-distance scatter plot for all the nearest-neighbor pairs of components, can be used not only shows the clustering of the component pairs that have similar directions (for skew estimation), but also shows the inter-component and the inter-line spacing (for component grouping) Therefore, by following both the skew direction and its perpendicular, text lines and paragraphs can be extracted This method is interesting because it can achieve multiple functions by its own

Hönes et al [70][70] also use nearest neighboring components as the starting point for text grouping Their nearest-neighbor finding algorithm is an effective heuristic that sets a search distance of 3 times the average component size The components found within this range further merge to form line and paragraphs An optimization function is used to enforce the parallelism and homogeneity of the component groups

Déforges et al [71][73] use a multiple resolution pyramid to join components into text lines and discard non-textual groups with the projection-match information from various levels For the component groups that may represent text lines, the slopes, component sizes, line alignments and homogeneity are tested to form final blocks This method can be used for postal processing

Chen et al [76] build statistical models for the textual structures such as words, text lines and paragraphs The idea is to use the correlations and likelihoods of textual

Trang 32

structures to assist the identification of the textual structures in a page for the ground truth generation of their databases They use connected component analysis to extract the bounding boxes as basic unit of their hierarchical representation of a page Then the components are classified as text/non-text by machine learning

2.2.2 Projection profile based

Typical projection profile based page segmentation methods are based on the idea that since the peaks and valleys on the projection profiles can be used to detect skew angles of a page, it should naturally be extended for identifying the Manhattan styled (also referred to as isothetic) paragraphs and columns Since this class of methods uses the horizontal/vertical projections for cutting the page into rectangular blocks, it has little tolerance to the existence of significant skew A preprocessing of skew correction is needed There is no report of this class of methods on segmenting documents with multiple skews

Baird [56][57] proposes a parametric model for the representation of generic textual structures in a document image Under the conditions that the skew angles of the images are within ±5°, the text lines are parallel within ±0.05°, and the character touching is minimal, he uses his infamous projection profile based method to estimate the skew angle, the vertical columns and the horizontal text lines Then, the process moves from “global-to-local” to segment characters and words

The well-known X-Y cut page segmentation method and its numerous variations [55][61][74][75][89][92] also belong to this class The documents to be processed should have minimal noises, skews, touching characters The projection profiles are used to guide the decomposition of the page The projections make take different forms For example, Ha et al [74][75] project the bounding boxes of the connected components horizontally and vertically to get the needed profiles

Trang 33

2.2.3 Morphological operation based

Typical morphological operation based page segmentation method apply Run Length Smoothing Algorithm (RLSA) on an image to fill the small gaps among the neighboring foreground pixels so that long stripes are formed to represent the text lines in the original image

Wong et al [94] uses their run-length smearing algorithm on a noise and skew

free image along the x- and the y-coordinates separately The two resultant images are

logically AND’ed and horizontally smoothed to produce a new image The small foreground blocks in the new image represent the text lines in the original image

There are many other systems that use this approach for their text block extraction stage [8][20][64]

2.2.4 Background based

While majority of page segmentation methods focus on the foreground features in the images, the background or white space is, as Baird points out [67], “a generic layout delimiter” and adopted as “nearly universal conventions of legibility” Therefore, this class of segmentation methods do not require prior knowledge of layout and insensitive to the page orientation

In Baird’s approach [67][68], the images are first skew corrected and non-textual components are removed Then, the small “white rectangles” in the background are merged only if they make a larger rectangular This process continues until no further expansion possible This results in the text regions being completely isolated By analyzing the distribution of the white covering rectangles, he is able to identify the columns, paragraphs and text lines

Normand et al [72] use a 2-D isotropic structuring element that is a regular octagon approximated disc These structuring elements grow recursively on the

Trang 34

background to identify the background streams By applying an isotropic multi-scale smearing, the image is completely segmented Because the use of octagon rather than rectangle as the shape of the structuring element, their method does not require a Manhattan styled nor a skew corrected page

2.2.5 Other approaches

There are still some other additional approaches, such as the method from Jain

et al using Gabor filter [97], the method from Tang et al using fractal geometry [80][83] They all have special merits in their algorithm design and the applications

Trang 35

Chapter 3 - Skew Estimation from Fiducial Lines 23

Skew estimation for textual document images needs to overcome some major obstacles at present One of them is the presence of excessive interfering noises originated from the non-textual objects in the document images Many existing methods require proper separation of the textual objects that are well aligned from the non-textual objects that are mostly nonaligned Some comparative evaluation work on the existing methods chooses only the text zones of the test image database Therefore, the object filtering or zoning stage is crucial to the skew detection stage However, it is difficult if not impossible to design general-purpose filters that are able

to discriminate noises from textual components

This chapter presents a robust, general-purpose skew estimation model that does not need any filtering or zoning preprocessing In fact, this model does apply filtering, but not on the input components at the beginning of the process; rather, it applies filtering on the output histogram at the end of the process Therefore, the problem of finding an applicable textual component filter has been transformed into finding an equivalent convolution filter on the output accumulator array

This model consists of the following steps: (i) the calculation of the slopes of the virtual lines that pass through the centroids of all the unique pairs of the connected components in an image, and quantize the arctangents of the slopes into a 1-D accumulator array that covers the range from -90° to +90°; (ii) the identification of prominent peaks by applying a convolution with a special kernel on the resultant

Trang 36

histogram The prominent peaks in the convolved histogram correspond to the possible skew angles of the image; (iii) the validation of the detection result by evaluating a signal/noise ratio in order to rule out unreliable results

Its computational complexity and detection precision are uncoupled, unlike those projection-profile-based or Hough-transform-based methods whose speeds drop when higher precision is in demand Speedup measures on the baseline implementation are also presented UW-I contains a large number of samples with significant amount noises Therefore, it is a good test suite for evaluating the noise immunity of the proposed model

3.1 Skew Estimation

This model works on the extracted components from a binary image It uses the centroids of the components as the representation the components (fiducial

Figure 3.1 An enlarged portion of a document image superposed with the fiducial lines

drawn among the centroids of components

Trang 37

points), as centroids are rotation-invariant, thus a proper choice for skew detection in the range of ±90° It traverses all the unique pairs of components to calculate the slopes of the virtual lines passing through their centroids (fiducial lines), as illustrated

in Figure 3.1 The arctangent values of the slopes of the fiducial lines are computed and quantized into an accumulator array to form an angle histogram, as shown in Figure 3.4 The prominent peaks in the histogram are the candidates for the detected skew angles Figure 3.2 superimposes the fiducial lines drawn along the angle at the peak position of the histogram in Figure 3.4 on the input image in Figure 3.1

3.1.1 Histogram generation

The complete skew estimation process is shown in the flowchart in Figure 3.3 Given a document image for skew estimation, the slope histogram is obtained in the following steps:

Image binarization: This model works at the component level, so color

Figure 3.2 The fiducial lines are drawn on the image in Figure 3.1 along the angles of 1.72±0.02°

Trang 38

documents are first converted to grayscale, then to bi-level by an appropriate thresholding method (global or moving window);

Connected-component analysis: The 8-neighbor connectivity analysis is done

on the pixels of the bi-level image to extract components and store them in a data structure, together with the calculated centroids and other important properties;

Component filtering (optional): The proposed model does not require the

exclusion of non-textual components However, filtering can improve, to a certain extent, the accuracy, reliability and efficiency of this model

Histogram generation: The slope histogram h[i] has a total N bin = 9000 bins representing a range of -90° inclusive to +90° exclusive, yielding an angle resolution

Input image

Connected-component analysis

Components filtering (optional)

Histogram convolution with a LoG kernel

Peak searching for skew angle determination

Skew Results verification

Histogram generation of the slopes of all fiducial lines

Figure 3.3 The flowchart of the fiducial line based skew estimation model

Định dạng
Số trang	77
Dung lượng	22,01 MB