Luận văn optical character recognition using neural networks

These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on nea

Trang 1

_BO GIAO DUC VA DAO TAO |

TRUONG DAI HOC BACH KHOA HÀ NỘI

THEODOR CONSTANTINESCU OPTICAL CHARATER RECONGNITION USING NEURAL

NETWORKS LUẬN VĂN THẠC SĨ KHOA HOC CHUYEN NGANH: XU LY THONG TIN VA TRUYEN

THONG

NGƯỜI HƯỚNG DẪN KHOA Hoc:

Hà Nội - 2009

Trang 2

BO GIAO DUC VA DAO TAO

TRUONG DAI IIQOC BACII KIIOA IIA NOI

RRAARAREA g RARERERER

‘THEODOR CONSTANTINESCL

OPTICAL CIIARATER RECONGNITION USING NEURAL

NETWORKS

LUẬN VĂN THAC Si KHOA HOC

CHUYÊN NGÀNH: XỬ 1Ý THÔNG TIN VÀ TRUYEN THONG

NGƯỜI HƯỚNG DẪN KIIOA HỌC: NGUYÊN LINII GIANG

HA NOI 2009

Trang 4

Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out

On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications

Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings

Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a

much more interesting It will take many decades to computers to read any document

with the same precision as human beings

All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks

Trang 5

differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's

Teview the notations nsed

‘The rules of probability

‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles

The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)

‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)

‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule

p(Bl Aip( A)

PB)

This means that if one knows the consequences of a case, the observation of effects

allows you to trace the causes

#(-1|8) =

Evidence notation

In practice, when probability is very close to 0 of 1, elements considered

themselves as very improbable should be observed to see the probability change

Evidence is defined as:

PB Eu(p) = log an luge — leg (1 — )

for clarity purposes, we often work in decibels (dB) with the following equivalence:

P

Ev(p) — 10 logig ———~- ®) a

An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af

evidence,

Comparison with classical statistics

The difference between the Bayesian inference and classical statistics is that:

an methods use impersonal meliods lo updilz personal probability, known as

subjective (probability is always subjective, when analysing its fundamentals),

* statistical methods use personal methods in order to treat impersonal frequencies

“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of

logistic models Exact conditional inference is based on the discrete distributions of

estimators or test statistics, conditional on certain other statistics taking their observed

Trang 6

prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human

intuition to generate hypotheses before we can start working

When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,

Actually, Altham in 1969 discovered a remarkable result, relating the two forms

of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples

‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning

Example: From which bow! is the cookie?

‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks

a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason

to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal

to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,

we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields

Before we observed the cookie, the probability we assigned for Fred having chosen bowl

441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust

revise the probability to P(H1 £), which is 0.6,

HIDDEN MARKOV MODEL

Hidden Markov models are a promising approach in different application areas

Trang 7

software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas

“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%

‘Today's OCR engines add the multiple algorithms of neural network technology

to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading

Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and

requires yon to “train” the OCR software with your documents

‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect

Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs

For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber

of dots per inch will increase accuracy for type under 6 points in sizs

Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation

The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,

Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different

approaches to improve script and handwriting recognition OCR software from

ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,

Trang 8

Trang 9

Ls Introduction

The difficulty of the dialogue between man and machine comes on onc hand

from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a

design of applications best suited to different forms of communication

commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day

In general the information to process is very sich It can be

text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just

consider for example the variety of styles of writing that it is between different languages

and even for the same language Morcever because of the sensitivity of the scnsors and

the media used to acquire and transmit, the information to be processed is often different

from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis

and decision systems This complexity is a major limiting factor in the context of the

dissemination of the informational means This remains true despite the growth of

calculation power and the improvement of processing systems since the

research is al the same time directed towards the resolution of more and more difficult

tasks and to the integration of thesc applications in cheaper and therefore low capacity

any other computer generated document In its modern form, it is a form of artificial

imelligence paler recognition

OCR is the most effective method available for transferring information from a

classical medium (usually, paper) to an electronic one The altemative would be a kuman

reading the characters in the image and typing them into a text editor, which is obviously

aslupid, Neanderthal approach when we possess the compulers with enough power lo do

dis mind-mumbing lask The only thing we need is the Tight OCR software

Before OCR can be used, the source material rust be scanned using an optical

scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR

Trang 10

where iL intends lo deal with quantified data thal can be partially wrang for example - recognition of images (charactors, fingcxprints, scarch for patterns and scquenecs in the

genes, etc.)

‘The data production model

A hidden Markov chain is a machine with states that we will nols

When the aufornalon passes through (he stale m if enits a piece of information yt that can

take N values The probability that the automaton emits a signal n when if is in this state,

Trang 11

HIDDEN MARKOV MODEL

Trang 12

Trang 13

p(Bl Aip( A)

PB)

#(-1|8) =

Evidence notation

P

Ev(p) — 10 logig ———~- ®) a

evidence,

Trang 14

Ls Introduction

Trang 15

Trang 16

HIDDEN MARKOV MODEL

Trang 17

HIDDEN MARKOV MODEL

Trang 18

Trang 19

Ls Introduction

Trang 20

IL Pattern recognition

Pattcrn recognition is a major arca of computing in which scarches arc particularly active There are a very Large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recogrition systems are # difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition, In this chapter

I give a general presentation of the main pattern recognition techniques

Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenamena This is accomplished

by comparison with models, In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in tum with cach prototype, classifying them into one of the classes being based on a selection criterion’ if the unknown best suits well with the "s" then it will belong to class "2" The difficultics that arise arc rclated to the sclection of a representative model, which best characterizes a form class, as well as detining an appropriate selection criterion, able to univocally classify each unknown form

Pattern recognition techniques can be divided into two main groups: generative

and discriminant, There havz been tong slanding debvaics on goncrative vs, discrimimalive methods, The disctiminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% ficus in Teal images wilh low false alarms, and such detectors do ot

“know” explicitly that a face has two cycs, Discriminative methods often need large tiaining data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is ali we need in an application, i.e, we don’t expec! lo generalize the algorithm to mach broader scope or utility fimctions Tn comparison, generative methods try to build models for the underlying patterns, and can

be learned, adapted, and generalized with small data,

BAYESIAN INFERENCE

‘The logical approach for calculating or revising the probability ofa hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives Iir the Bayesian perspective probabilily is nol interproted as (he transition Lo the limit of a froqueney, bul ralher as the digital tianslation of a state of knowledge (the degree of confidence in a hypothesis)

“the Bayesian inference is based on the handling of probabilistic statements ‘The Bayesian inference is particularly useful in the problems of induction Bayesian methods

Trang 21

Ls Introduction

Trang 22

p(Bl Aip( A)

PB)

#(-1|8) =

Evidence notation

P

Ev(p) — 10 logig ———~- ®) a

evidence,

Trang 23

BAYESIAN INFERENCE

Trang 24

p(Bl Aip( A)

PB)

#(-1|8) =

Evidence notation

P

Ev(p) — 10 logig ———~- ®) a

evidence,

Trang 25

genes, etc.)

Trang 26

p(Bl Aip( A)

PB)

#(-1|8) =

Evidence notation

P

Ev(p) — 10 logig ———~- ®) a

evidence,

Trang 27

Trang 28

Trang 29

Trang 30

Trang 31

BAYESIAN INFERENCE

Trang 32

Ls Introduction

Trang 33

HIDDEN MARKOV MODEL

Trang 34

genes, etc.)

Trang 35

Trang 36

BAYESIAN INFERENCE

Trang 37

Tiêu đề	Optical Character Recognition Using Neural Networks
Tác giả	Theodor Constantinescu
Người hướng dẫn	Nguyễn Linh Giang
Trường học	Trường Đại Học Bách Khoa Hà Nội
Chuyên ngành	Xử Lý Thông Tin Và Truyền Thông
Thể loại	Luận văn
Năm xuất bản	2009
Thành phố	Hà Nội

Định dạng
Số trang	75
Dung lượng	377,11 KB