These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on nea
Trang 1
_BO GIAO DUC VA DAO TAO |
TRUONG DAI HOC BACH KHOA HÀ NỘI
THEODOR CONSTANTINESCU OPTICAL CHARATER RECONGNITION USING NEURAL
NETWORKS LUẬN VĂN THẠC SĨ KHOA HOC CHUYEN NGANH: XU LY THONG TIN VA TRUYEN
THONG
NGƯỜI HƯỚNG DẪN KHOA Hoc:
Hà Nội - 2009
Trang 2
BO GIAO DUC VA DAO TAO
TRUONG DAI IIQOC BACII KIIOA IIA NOI
RRAARAREA g RARERERER
‘THEODOR CONSTANTINESCL
OPTICAL CIIARATER RECONGNITION USING NEURAL
NETWORKS
LUẬN VĂN THAC Si KHOA HOC
CHUYÊN NGÀNH: XỬ 1Ý THÔNG TIN VÀ TRUYEN THONG
NGƯỜI HƯỚNG DẪN KIIOA HỌC: NGUYÊN LINII GIANG
HA NOI 2009
Trang 4
Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 5differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's
Teview the notations nsed
‘The rules of probability
‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles
The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)
‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)
‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
p(Bl Aip( A)
PB)
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes
#(-1|8) =
Evidence notation
In practice, when probability is very close to 0 of 1, elements considered
themselves as very improbable should be observed to see the probability change
Evidence is defined as:
PB Eu(p) = log an luge — leg (1 — )
for clarity purposes, we often work in decibels (dB) with the following equivalence:
P
Ev(p) — 10 logig ———~- ®) a
An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af
evidence,
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
an methods use impersonal meliods lo updilz personal probability, known as
subjective (probability is always subjective, when analysing its fundamentals),
* statistical methods use personal methods in order to treat impersonal frequencies
“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of
logistic models Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
Trang 6prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human
intuition to generate hypotheses before we can start working
When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples
‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bow! is the cookie?
‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks
a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason
to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal
to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,
we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust
revise the probability to P(H1 £), which is 0.6,
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
Trang 7software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,
Trang 8Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 9Ls Introduction
The difficulty of the dialogue between man and machine comes on onc hand
from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a
design of applications best suited to different forms of communication
commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very sich It can be
text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language Morcever because of the sensitivity of the scnsors and
the media used to acquire and transmit, the information to be processed is often different
from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis
and decision systems This complexity is a major limiting factor in the context of the
dissemination of the informational means This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is al the same time directed towards the resolution of more and more difficult
tasks and to the integration of thesc applications in cheaper and therefore low capacity
any other computer generated document In its modern form, it is a form of artificial
imelligence paler recognition
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one The altemative would be a kuman
reading the characters in the image and typing them into a text editor, which is obviously
aslupid, Neanderthal approach when we possess the compulers with enough power lo do
dis mind-mumbing lask The only thing we need is the Tight OCR software
Before OCR can be used, the source material rust be scanned using an optical
scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR
Trang 10
where iL intends lo deal with quantified data thal can be partially wrang for example - recognition of images (charactors, fingcxprints, scarch for patterns and scquenecs in the
genes, etc.)
‘The data production model
A hidden Markov chain is a machine with states that we will nols
When the aufornalon passes through (he stale m if enits a piece of information yt that can
take N values The probability that the automaton emits a signal n when if is in this state,
Trang 11prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human
intuition to generate hypotheses before we can start working
When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples
‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bow! is the cookie?
‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks
a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason
to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal
to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,
we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust
revise the probability to P(H1 £), which is 0.6,
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
Trang 12Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 13differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's
Teview the notations nsed
‘The rules of probability
‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles
The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)
‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)
‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
p(Bl Aip( A)
PB)
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes
#(-1|8) =
Evidence notation
In practice, when probability is very close to 0 of 1, elements considered
themselves as very improbable should be observed to see the probability change
Evidence is defined as:
PB Eu(p) = log an luge — leg (1 — )
for clarity purposes, we often work in decibels (dB) with the following equivalence:
P
Ev(p) — 10 logig ———~- ®) a
An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af
evidence,
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
an methods use impersonal meliods lo updilz personal probability, known as
subjective (probability is always subjective, when analysing its fundamentals),
* statistical methods use personal methods in order to treat impersonal frequencies
“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of
logistic models Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
Trang 14Ls Introduction
The difficulty of the dialogue between man and machine comes on onc hand
from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a
design of applications best suited to different forms of communication
commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very sich It can be
text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language Morcever because of the sensitivity of the scnsors and
the media used to acquire and transmit, the information to be processed is often different
from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis
and decision systems This complexity is a major limiting factor in the context of the
dissemination of the informational means This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is al the same time directed towards the resolution of more and more difficult
tasks and to the integration of thesc applications in cheaper and therefore low capacity
any other computer generated document In its modern form, it is a form of artificial
imelligence paler recognition
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one The altemative would be a kuman
reading the characters in the image and typing them into a text editor, which is obviously
aslupid, Neanderthal approach when we possess the compulers with enough power lo do
dis mind-mumbing lask The only thing we need is the Tight OCR software
Before OCR can be used, the source material rust be scanned using an optical
scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR
Trang 15
Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 16prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human
intuition to generate hypotheses before we can start working
When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples
‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bow! is the cookie?
‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks
a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason
to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal
to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,
we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust
revise the probability to P(H1 £), which is 0.6,
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
Trang 17prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human
intuition to generate hypotheses before we can start working
When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples
‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bow! is the cookie?
‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks
a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason
to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal
to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,
we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust
revise the probability to P(H1 £), which is 0.6,
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
Trang 18Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 19Ls Introduction
The difficulty of the dialogue between man and machine comes on onc hand
from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a
design of applications best suited to different forms of communication
commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very sich It can be
text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language Morcever because of the sensitivity of the scnsors and
the media used to acquire and transmit, the information to be processed is often different
from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis
and decision systems This complexity is a major limiting factor in the context of the
dissemination of the informational means This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is al the same time directed towards the resolution of more and more difficult
tasks and to the integration of thesc applications in cheaper and therefore low capacity
any other computer generated document In its modern form, it is a form of artificial
imelligence paler recognition
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one The altemative would be a kuman
reading the characters in the image and typing them into a text editor, which is obviously
aslupid, Neanderthal approach when we possess the compulers with enough power lo do
dis mind-mumbing lask The only thing we need is the Tight OCR software
Before OCR can be used, the source material rust be scanned using an optical
scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR
Trang 20
IL Pattern recognition
Pattcrn recognition is a major arca of computing in which scarches arc particularly active There are a very Large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recogrition systems are # difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition, In this chapter
I give a general presentation of the main pattern recognition techniques
Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenamena This is accomplished
by comparison with models, In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in tum with cach prototype, classifying them into one of the classes being based on a selection criterion’ if the unknown best suits well with the "s" then it will belong to class "2" The difficultics that arise arc rclated to the sclection of a representative model, which best characterizes a form class, as well as detining an appropriate selection criterion, able to univocally classify each unknown form
Pattern recognition techniques can be divided into two main groups: generative
and discriminant, There havz been tong slanding debvaics on goncrative vs, discrimimalive methods, The disctiminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% ficus in Teal images wilh low false alarms, and such detectors do ot
“know” explicitly that a face has two cycs, Discriminative methods often need large tiaining data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is ali we need in an application, i.e, we don’t expec! lo generalize the algorithm to mach broader scope or utility fimctions Tn comparison, generative methods try to build models for the underlying patterns, and can
be learned, adapted, and generalized with small data,
BAYESIAN INFERENCE
‘The logical approach for calculating or revising the probability ofa hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives Iir the Bayesian perspective probabilily is nol interproted as (he transition Lo the limit of a froqueney, bul ralher as the digital tianslation of a state of knowledge (the degree of confidence in a hypothesis)
“the Bayesian inference is based on the handling of probabilistic statements ‘The Bayesian inference is particularly useful in the problems of induction Bayesian methods
Trang 21Ls Introduction
The difficulty of the dialogue between man and machine comes on onc hand
from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a
design of applications best suited to different forms of communication
commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very sich It can be
text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language Morcever because of the sensitivity of the scnsors and
the media used to acquire and transmit, the information to be processed is often different
from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis
and decision systems This complexity is a major limiting factor in the context of the
dissemination of the informational means This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is al the same time directed towards the resolution of more and more difficult
tasks and to the integration of thesc applications in cheaper and therefore low capacity
any other computer generated document In its modern form, it is a form of artificial
imelligence paler recognition
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one The altemative would be a kuman
reading the characters in the image and typing them into a text editor, which is obviously
aslupid, Neanderthal approach when we possess the compulers with enough power lo do
dis mind-mumbing lask The only thing we need is the Tight OCR software
Before OCR can be used, the source material rust be scanned using an optical
scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR
Trang 22
differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's
Teview the notations nsed
‘The rules of probability
‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles
The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)
‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)
‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
p(Bl Aip( A)
PB)
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes
#(-1|8) =
Evidence notation
In practice, when probability is very close to 0 of 1, elements considered
themselves as very improbable should be observed to see the probability change
Evidence is defined as:
PB Eu(p) = log an luge — leg (1 — )
for clarity purposes, we often work in decibels (dB) with the following equivalence:
P
Ev(p) — 10 logig ———~- ®) a
An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af
evidence,
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
an methods use impersonal meliods lo updilz personal probability, known as
subjective (probability is always subjective, when analysing its fundamentals),
* statistical methods use personal methods in order to treat impersonal frequencies
“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of
logistic models Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
Trang 23IL Pattern recognition
Pattcrn recognition is a major arca of computing in which scarches arc particularly active There are a very Large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recogrition systems are # difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition, In this chapter
I give a general presentation of the main pattern recognition techniques
Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenamena This is accomplished
by comparison with models, In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in tum with cach prototype, classifying them into one of the classes being based on a selection criterion’ if the unknown best suits well with the "s" then it will belong to class "2" The difficultics that arise arc rclated to the sclection of a representative model, which best characterizes a form class, as well as detining an appropriate selection criterion, able to univocally classify each unknown form
Pattern recognition techniques can be divided into two main groups: generative
and discriminant, There havz been tong slanding debvaics on goncrative vs, discrimimalive methods, The disctiminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% ficus in Teal images wilh low false alarms, and such detectors do ot
“know” explicitly that a face has two cycs, Discriminative methods often need large tiaining data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is ali we need in an application, i.e, we don’t expec! lo generalize the algorithm to mach broader scope or utility fimctions Tn comparison, generative methods try to build models for the underlying patterns, and can
be learned, adapted, and generalized with small data,
BAYESIAN INFERENCE
‘The logical approach for calculating or revising the probability ofa hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives Iir the Bayesian perspective probabilily is nol interproted as (he transition Lo the limit of a froqueney, bul ralher as the digital tianslation of a state of knowledge (the degree of confidence in a hypothesis)
“the Bayesian inference is based on the handling of probabilistic statements ‘The Bayesian inference is particularly useful in the problems of induction Bayesian methods
Trang 24differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's
Teview the notations nsed
‘The rules of probability
‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles
The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)
‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)
‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
p(Bl Aip( A)
PB)
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes
#(-1|8) =
Evidence notation
In practice, when probability is very close to 0 of 1, elements considered
themselves as very improbable should be observed to see the probability change
Evidence is defined as:
PB Eu(p) = log an luge — leg (1 — )
for clarity purposes, we often work in decibels (dB) with the following equivalence:
P
Ev(p) — 10 logig ———~- ®) a
An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af
evidence,
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
an methods use impersonal meliods lo updilz personal probability, known as
subjective (probability is always subjective, when analysing its fundamentals),
* statistical methods use personal methods in order to treat impersonal frequencies
“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of
logistic models Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
Trang 25where iL intends lo deal with quantified data thal can be partially wrang for example - recognition of images (charactors, fingcxprints, scarch for patterns and scquenecs in the
genes, etc.)
‘The data production model
A hidden Markov chain is a machine with states that we will nols
When the aufornalon passes through (he stale m if enits a piece of information yt that can
take N values The probability that the automaton emits a signal n when if is in this state,
Trang 26differ from standard methods known by the syslcmatic application of formal rules of transformation of probabilitics Before procecding to the description of these rulcs, let's
Teview the notations nsed
‘The rules of probability
‘There are only Iwo rules for combining probabilities, and ơn them the theory of Rayesian analysis is buill, These rules are the addition and rudtiplication miles
The addition rule PAU RIC) = plA[C] + (BC) — plan BIC)
‘the multiplication rule PCAN B) = pl Al Bip(B) = (| Ajp(A)
‘The Bayes theorem can be derived simply by taking advantage of the symmetry of the multiplication rule
p(Bl Aip( A)
PB)
This means that if one knows the consequences of a case, the observation of effects
allows you to trace the causes
#(-1|8) =
Evidence notation
In practice, when probability is very close to 0 of 1, elements considered
themselves as very improbable should be observed to see the probability change
Evidence is defined as:
PB Eu(p) = log an luge — leg (1 — )
for clarity purposes, we often work in decibels (dB) with the following equivalence:
P
Ev(p) — 10 logig ———~- ®) a
An evidences of -40 dB corresponds lo a probability of 104, ele Rv stands for weight af
evidence,
Comparison with classical statistics
The difference between the Bayesian inference and classical statistics is that:
an methods use impersonal meliods lo updilz personal probability, known as
subjective (probability is always subjective, when analysing its fundamentals),
* statistical methods use personal methods in order to treat impersonal frequencies
“The Bayesian and exact conditional approaches to the analysts of binary data are very different, both in philosophy and implementation Bayesian inference is based on the posterior distributions of quantifies of interest such as probabililics or parameters of
logistic models Exact conditional inference is based on the discrete distributions of
estimators or test statistics, conditional on certain other statistics taking their observed
Trang 27software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,
Trang 28software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,
Trang 29Which malhad will be mors cffective depends on the image being scared A bilevel scan of a shopwom page may yicld morc legible text But if the image to be scanned has text ina range of colors, as in a brochure, text in lighter colors may drop out
On-line systems for recognizing hand-printed text on the fly have become well- known as commercial products in recent years, Among these are the inpul devices for personal digilal assislanis such as those tuning Palm OS The algorithms uscd in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known, Also, the user can be retrained to use only specific letter shapes These methods cannot be used in software that scans paper documents, so accurals recognition of hand-prinled documenls is sill largcly an apon problem Accuracy ratzs of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications
Whereas commercial and even open source OCR software performs well for, lel's say, usual images, a patticularly difficult problem for computers and humans is that of the old religious registers of baptisms and marriages, which contain mainly the names, where the pages can be damaged by weather, water or fire, and the names can be obsolete or wnitten by former spellings
Character recognition is an active area of research for computer science since the late 1950s, Initially, it was thought to be an easy problem, but it appeared that this was a
much more interesting It will take many decades to computers to read any document
with the same precision as human beings
All the commercial software is quite complex My aim was to create a simple and reliable program to perform the same tasks
Trang 30software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,
Trang 31IL Pattern recognition
Pattcrn recognition is a major arca of computing in which scarches arc particularly active There are a very Large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recogrition systems are # difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition, In this chapter
I give a general presentation of the main pattern recognition techniques
Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenamena This is accomplished
by comparison with models, In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in tum with cach prototype, classifying them into one of the classes being based on a selection criterion’ if the unknown best suits well with the "s" then it will belong to class "2" The difficultics that arise arc rclated to the sclection of a representative model, which best characterizes a form class, as well as detining an appropriate selection criterion, able to univocally classify each unknown form
Pattern recognition techniques can be divided into two main groups: generative
and discriminant, There havz been tong slanding debvaics on goncrative vs, discrimimalive methods, The disctiminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% ficus in Teal images wilh low false alarms, and such detectors do ot
“know” explicitly that a face has two cycs, Discriminative methods often need large tiaining data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is ali we need in an application, i.e, we don’t expec! lo generalize the algorithm to mach broader scope or utility fimctions Tn comparison, generative methods try to build models for the underlying patterns, and can
be learned, adapted, and generalized with small data,
BAYESIAN INFERENCE
‘The logical approach for calculating or revising the probability ofa hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives Iir the Bayesian perspective probabilily is nol interproted as (he transition Lo the limit of a froqueney, bul ralher as the digital tianslation of a state of knowledge (the degree of confidence in a hypothesis)
“the Bayesian inference is based on the handling of probabilistic statements ‘The Bayesian inference is particularly useful in the problems of induction Bayesian methods
Trang 32Ls Introduction
The difficulty of the dialogue between man and machine comes on onc hand
from the flexibility and variety of modes of interaction that we are able to use: gesture, speech, writing, etc and also the rigidity of those classically offered by compulsr systems Part of the current research in TT is therefore a
design of applications best suited to different forms of communication
commonly used by man This is to provide the computer systems with features for handling the information that humans manipulate themselves currently every day
In general the information to process is very sich It can be
text, tables, images, words, sounds, writing, and gestures, In this paper I treat the case of writing, to be more precise, printed character recognition By the application and personal contexts the way to sepresent this information and transmit it is very variable Just
consider for example the variety of styles of writing that it is between different languages
and even for the same language Morcever because of the sensitivity of the scnsors and
the media used to acquire and transmit, the information to be processed is often different
from the originals It is therefore characterized by either intrinsic to the phenomena to which they are cither related to them transmission ways inaccuracies Their treatment requires le implzrnentation of ~— complex smalysis
and decision systems This complexity is a major limiting factor in the context of the
dissemination of the informational means This remains true despite the growth of
calculation power and the improvement of processing systems since the
research is al the same time directed towards the resolution of more and more difficult
tasks and to the integration of thesc applications in cheaper and therefore low capacity
any other computer generated document In its modern form, it is a form of artificial
imelligence paler recognition
OCR is the most effective method available for transferring information from a
classical medium (usually, paper) to an electronic one The altemative would be a kuman
reading the characters in the image and typing them into a text editor, which is obviously
aslupid, Neanderthal approach when we possess the compulers with enough power lo do
dis mind-mumbing lask The only thing we need is the Tight OCR software
Before OCR can be used, the source material rust be scanned using an optical
scanner (and sometimes a specialized cixcnit board in the PC) to read in the page as a bitmap (a pattem of dots) Software to recognize the images is also required The OCR
Trang 33
prion’ an arbitrary method and assumption and don't teat the data until aor that Baycsian methods, because they do not require fixed prior hypothesis, have paved the way for the automatic data mining, there is indeed no more need to use Prior human
intuition to generate hypotheses before we can start working
When should we use one or the other? The two approaches are comuptementary: the statistic is generally better when information is abundant and Tow cost of collection, Bayesian where it is poor and /or costly to collect, In case of abundance data, the results are asymptotically the same for each method, the Bayesian calculation being simply more costly, In contrast, the Bayesian can handle cases where statistics would not have énough dala to apply the fimail thaorerns,
Actually, Altham in 1969 discovered a remarkable result, relating the two forms
of inference for the analysis of a2 x2 contingency table, this result is hard to generalise tomore complex examples
‘The Bayesian psy-test (which is used to determine the plausibility of a distribution compared to the observations) asymptotically converges to the x in classical statistics as the number of observations becomes large The seemingly arbitrary choice ofa Euclidean distance in the 7 is perfectly justified a posteriori by the Bayesian reasoning
Example: From which bow! is the cookie?
‘To illustrate, suppose there are two full bowls of cookies Bowl #1 has 10 chocolate chip and 30 plain cookies, while bow! #2 has 20 of each, Our fiend l'red picks
a bowl at randorn, and then picks ä cookie al random We tay assume there is ne reason
to believe Fred treats one bow! differently fiom another, likewise for the cookies The cookie turns out to be a plain one, How probable is it that Fred picked it out of bowt #1? Tntuitively, if seems clear thal the answer should be more than a half, since there are more plain cookies im bowl #1 The precise answer is given by Bayes's theorem, Let Ai correspond to bowl #f1, and H+ to bowl 12 It is given that the bowls are identical from Fred's point of view, thus PG) — PCH), and the two must add up to 1, so both are equal
to 0.5 The event 7 is the observation af a plain conkie From the contents of the howls,
we know that P(E | Hi) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5 Bayos's formula then vields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl
441 was the prior probability, PU), which was 0.5 Aller abscrving the cookie, we aust
revise the probability to P(H1 £), which is 0.6,
HIDDEN MARKOV MODEL
Hidden Markov models are a promising approach in different application areas
Trang 34where iL intends lo deal with quantified data thal can be partially wrang for example - recognition of images (charactors, fingcxprints, scarch for patterns and scquenecs in the
genes, etc.)
‘The data production model
A hidden Markov chain is a machine with states that we will nols
When the aufornalon passes through (he stale m if enits a piece of information yt that can
take N values The probability that the automaton emits a signal n when if is in this state,
Trang 35software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,
Trang 36IL Pattern recognition
Pattcrn recognition is a major arca of computing in which scarches arc particularly active There are a very Large number of applications that may require a recognition module in processing systems designed to automate certain tasks for humans Among those handwriting recogrition systems are # difficult issue to handle as they are grouped alone much of the difficulties encountered in pattern recognition, In this chapter
I give a general presentation of the main pattern recognition techniques
Pattern recognition is the set of the methods and techniques with which we can achieve a classification in a set of objects, processes or phenamena This is accomplished
by comparison with models, In memory of the computer a set of models (prototypes), one for each class is stored The new, unknown input (not classified yet) is compared in tum with cach prototype, classifying them into one of the classes being based on a selection criterion’ if the unknown best suits well with the "s" then it will belong to class "2" The difficultics that arise arc rclated to the sclection of a representative model, which best characterizes a form class, as well as detining an appropriate selection criterion, able to univocally classify each unknown form
Pattern recognition techniques can be divided into two main groups: generative
and discriminant, There havz been tong slanding debvaics on goncrative vs, discrimimalive methods, The disctiminative methods aim to minimize a utility function (e.g classification error) and it does not need to model, represent, or “understand” the pattern explicitly For example, nowadays we have very effective discriminative methods They can detect 99.99% ficus in Teal images wilh low false alarms, and such detectors do ot
“know” explicitly that a face has two cycs, Discriminative methods often need large tiaining data, say 100,000 labeled examples, and can hardly be generalized We should use them if we know for sure that the recognition is ali we need in an application, i.e, we don’t expec! lo generalize the algorithm to mach broader scope or utility fimctions Tn comparison, generative methods try to build models for the underlying patterns, and can
be learned, adapted, and generalized with small data,
BAYESIAN INFERENCE
‘The logical approach for calculating or revising the probability ofa hypothesis is called Bayesian inference This is governed by the classic rules of probability combination, from which the Bayes theorem derives Iir the Bayesian perspective probabilily is nol interproted as (he transition Lo the limit of a froqueney, bul ralher as the digital tianslation of a state of knowledge (the degree of confidence in a hypothesis)
“the Bayesian inference is based on the handling of probabilistic statements ‘The Bayesian inference is particularly useful in the problems of induction Bayesian methods
Trang 37software thon pro these scans to differentiate belween images and loxt and detcrmine what Ictters are represented in the light and dark arcas
“The approach in older OCR programs was still animal It was simply to compare the characters to be recognized with the sample characters stored in a data base Imagine the numbers of comparisons, considering how many different fonts exisl Modern OCR sofware use complex nenral-rictwork-bascd sysicmns lo obiain betier resus — much more exact identification — actually close to100%
‘Today's OCR engines add the multiple algorithms of neural network technology
to analyze the stroke edge, the line of discontinuity betwesn the text characters, and the background Atlowing for irregularities of printed ink on paper, cach algorithm averages the light and dark along the side of a stroke, matclies it to known characters and makes a best guess as to which character it is, I'he OCR software then averages ot polls the results from all the algorithms to obtain a single reading
Advances have made OCR more teliable; expect a minimum of 90% accuracy for average-quality documents Despite vendor claims of one-button scanning, achieving, 99% or greater accuracy takes clean copy and practice setting scanmer parameters and
requires yon to “train” the OCR software with your documents
‘The first step toward better recognition begins with the scanner The quality of its charge-coupled device light arrays will affect OCR results The more tightly packed these aurays, the finer the image and the more distinct colors the scanner can detect
Smmadges or background color can fool the recognition software Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade- offs
For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits’ worth of color information This scan will take longer Than ä lower-ssolulion scan and produce a larger file, bul OCR accuracy will likely be thigh A sear at 72 dpi will be faster anid produce # smaller file—good for posting an image of the text to the Web but the lower resolution will likely degrade OCR accuracy Most scanners are optimized for 300 dpi, but scanning at a higher mumber
of dots per inch will increase accuracy for type under 6 points in sizs
Bilovel (black and white onty) seans arc thie Tull for text documents Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel, Some scanners can also let you detenmine how subtle to make the color differentiation
The accurale recognition of Latin-based typewsillen (ext is now considered largdly a solved problem Typical accuracy ralcs zxoccd 99%, although ccrlain applications demanding even higher accuracy require human review for errors Other areas - inclading recognition of cursive handwriting, and printed text in other scripts (especially those with a very large number of characters) - are still the subject of active rosoarah,
Today, OCR software can recognize a wide variety of fouls, bul handwriting and script fonts that mimic handwriting are still problematic, Developers are taking different
approaches to improve script and handwriting recognition OCR software from
ExperVision Inc first identifies the font and then runs ils character-recogrition algorithms,