Báo cáo khoa học: "Efficiency, Robustness and Accuracy in Picky Chart Parsing*" docx

Using a subopti- mal search method, "Picky significantly reduces the number of edges produced by CKY-like chart parsing algorithms, while maintaining the robustness of pure bottom-up par

Trang 1

Efficiency, Robustness and Accuracy

in Picky Chart

David M Magerman

S t a n f o r d U n i v e r s i t y

S t a n f o r d , C A 9 4 3 0 5

m a g e r m a n @ c s s t a n f o r d e d u

Parsing*

Carl Weir

P a r a m a x S y s t e m s

P a o l i , P A 19301

w e i r @ p r c u n i s y s c o m

A B S T R A C T

T h i s p a p e r d e s c r i b e s P i c k y , a p r o b a b i l i s t i c a g e n d a - b a s e d

c h a r t p a r s i n g a l g o r i t h m w h i c h u s e s a t e c h n i q u e c a l l e d p~'ob-

abilistic prediction to predict which grammar rules are likely

to lead to an acceptable parse of the input Using a subopti-

mal search method, "Picky significantly reduces the number of

edges produced by CKY-like chart parsing algorithms, while

maintaining the robustness of pure bottom-up parsers and

the accuracy of existing probabilistic parsers Experiments

using Picky demonstrate how probabilistic modelling can im-

pact upon the efficiency, robustness and accuracy of a parser

1 I n t r o d u c t i o n This paper addresses the question: Why should we use

probabilistic models in natural language understanding?

There are many answers to this question, only a few of

which are regularly addressed in the literature

The first and most common answer concerns ambigu~

ity resolution A probabilistic model provides a clearly

defined preference nile for selecting among grammati-

cal alternatives (i.e the highest probability interpreta-

tion is selected) However, this use of probabilistic mod-

els assumes that we already have efficient methods for

generating the alternatives in the first place While we

have O(n 3) algorithms for determining the grammatical-

ity of a sentence, parsing, as a component of a natural

language understanding tool, involves more than simply

determining all of the grammatical interpretations of an

input Ill order for a natural language system to process

input efficiently and robustly, it must process all intelligi-

ble sentences, grammatical or not, while not significantly

reducing the system's efficiency

This observ~ttiou suggests two other answers to the cen-

tral question of this paper Probabilistic models offer

a convenient scoring method for partial interpretations

in a well-formed substring table High probability con-

stituents in the parser's chart call be used to interpret

ungrammat.ical sentences Probabilistic models can also

*Special I.hanks to J e r r y H o b b s a n d F3ob Moo*re at S[II for

p r o v i d i n g access to their colllptllel's, a n d to Salim ]-/oukos, Pe-

l:er B r o w n , a n d V i n c e n t a n d Steven Della Piel.ra ,-xt IF3M for their

inst.ructive lessons on probabi|isti,: m o d e l l i n g of n a t u r a l I:mguage

be used for efficiency by providing a best-first search heuristic to order the parsing agenda

This paper proposes an agenda-based probabilistic chart parsing algorithm which is both robust and efficient T h e algorithm, 7)icky 1, is considered robust because it will potentially generate all constituents produced by a pure

b o t t o m - u p parser and rank these constituents by likelihood T h e efficiency of the algorithm is achieved through

the algorithm avoid worst-case behavior Probabilistic prediction is a trainable technique for modelling where edges are likely to occur in the chart-parsing process 2 Once the predicted edges are added to the chart using probabilistic prediction, they are processed in a style similar to agenda-based chart parsing algorithms By limiting the edges in the chart to those which are predicted by this model, the parser can process a sentence while generating only the most likely constituents given the input

In this paper, we will present the "Picky parsing algorithm, describing both the original features of the parser and those adapted from previous work Then,

we will compare the implementation of `picky with existing probabilistic and non-probabilistic parsers Finally,

we will report the results of experiments exploring how

`picky's algorithm copes with the tradeoffs of efficiency, robustness, and accuracy 3

2 P r o b a b i l i s t i c M o d e l s i n " P i c k y The probabilistic models used ill the implementation of

"Picky are independent of the algorithm To facilita.te the comparison between the performance of "Picky and its predecessor, "Pearl, the probabilistic model ilnplelnented for "Picky is similar to "Pearl's scoring nlodel, the context-

l ' p e a r l = p r o b a b i l i s t i c E a r l e y - s t y l e p a r s e r ( ~ - E a r l ) "Picky =- probabilistic CI(Y-like p a r s e r ( ' P - C K Y )

2 S o m e f a m i l i a r i t y with c h a r t p a r s i n g t e r m i n o l o g y is a s s u m e d in this p a p e r For terminological d e f i n i t i o n s , see [9], [t0l, [11], or [17]

3 S e c t i o n s 2 a n d 3, t h e d e s c r i p t i o n s of t h e p r o b a b i l i s t i e m o d e l s used in ",Picky a n d t h e T'icky a l g o r i t h n , , are s i m i l a r in c o n t e n t

to t h e c o r r e s p o n d i n g s e c t i o n s of M a g e r n m n a n d Weir[13] T h e

e x p e r i m e n t a l r e s u l t s a n d d i s c u s s i o n s w h i c h follow in s e c t i o n s .1-6

~tre original

4 0

Trang 2

free grammar with context-sensitive probability (CFG

with CSP) model This probabilistic model estimates

the probability of each parse T given the words in the

sentence S, P ( T I S ) , by assuming that each non-terminal

and its immediate children are dependent on the non-

terminal's siblings and parent and on the part-of-speech

trigram centered at the beginning of that rule:

P(TIS) ~- I I P ( A + a]C ~ 13A7, aoala2) (1)

A E T

where C is the non-terminal node which immediately

dominates A, al is the part-of-speech associated with the

leftmost word of constituent A, and a0 and a2 are the

parts-of-speech of the words to the left and to the right

of al, respectively See Magerman and Marcus 1991 [12]

for a more detailed description of the CFG with CSP

model

A probabilistic language model, such as the aforemen-

tioned CFG with CSP model, provides a metric for eval-

uating the likelihood of a parse tree However, while it

may suggest a method for evaluating partial parse trees,

a language model alone does not dictate the search strat-

egy for determining the most likely analysis of an input

Since exhaustive search of the space of parse trees pro-

duced by a natural language grammar is generally not

feasible, a parsing model can best take advantage of a

probabilistic language model by incorporating it into a

parser which probabilistically models the parsing pro-

cess "Picky attempts to model the chart parsing process

for context-free grammars using probabilistic prediction

corner phase (I), covered bidirectional phase (II), and

tree completion phase (III) Each phase uses a differ-

ent m e t h o d for proposing edges to be introduced to the

parse chart The first phase, covered left-corner, uses

probabilistic prediction based on the left-corner word of

the left-most daughter of a constituent to propose edges

The covered bidirectional phase also uses probabilistic

prediction, but it allows prediction to occur from the

left-corner word of any daughter of a constituent, and

parses that constituent outward (bidirectionally) from

that daughter These phases are referred to as "cov-

ered" because, during these phases, the parsing mech-

anism proposes only edges that have non-zero proba-

bility according to the prediction model, i.e tha.t have

been covered by the training process The final phase,

tree completion, is essentially an exhaustive search of all

interpretations of the input, according to the gra.mn]a.r

However, the search proceeds in best-first order, accord-

ing to the measures provided by the language model

This phase is used only when the probabilistic prediction

model fails to propose the edges necessary to complete

a parse of the sentence

The following sections will present and motivate the prediction techniques used by the algorithm, and will then describe how they are implemented in each phase

3 1 P r o b a b i l i s t i c P r e d i c t i o n Probabilistie prediction is a general m e t h o d for using probabilistic information extracted from a parsed corpus

to estimate the likelihood that predicting an edge at a certain point in the chart will lead to a correct analysis

of the sentence The P i c k y algorithm is not dependent

on the specific probabilistic prediction model used T h e model used in the implementation, which is similar to the probabilistic language model, will be described 4 The prediction model used in the implementation of

P i c k y estimates the probability that an edge proposed

at a point in the chart will lead to a correct parse to be:

P ( A + otB[3]aoal a~ ), (2)

where ax is the part-of-speech of the left-corner word of

B, a0 is the part-of-speech of the word to the left of al, and a~ is the part-of-speech of the word to the right of

a l

To illustrate how this model is used, consider the sentence

The word "cow" in the word sequence "the cow raced" predicts N P + d e t n, but not N P 4 d e t n P P , since P P is unlikely to generate a verb, based on training material, s Assuming the prediction model is well

as the beginning of a participial phrase modifying "the cow," as in

ticiple will receive a low probability estimate relative to the verb interpretation, since the prediction naodel only considers local context

4It is not necessary for ~he prediction model to be the s a m e as the language model used to evaluate c o m p l e t e analyses However,

it is helpful if this is the ca.se, so t h a t the p r o b a b i l i t y e s t i m a t e s of incomplete edges will be consistent w i t h the p r o b a b i l i t y e s t i m a t e s

of completed constituents

S T h r o u g h o u t this discussion, we will describe the prediction process using wo,-ds as the predictors of edges In the i m p l e m e n t a - tion, due to s p a r s e d a t a concerns, only p a r t s - o f - s p e e c h are used to predict edges Give,, more r o b u s t e s t i m a t i o n t e c h n i q u e s , a p r o b - abilistic prediction model conditioned on word s e q u e n c e s is likely

to perform as well or better

41

Trang 3

T h e process of probabilistic prediction is analogous to

t h a t of a h u m a n parser recognizing predictive lexical

items or sequences in a sentence and using these hints to

restrict the search for the correct analysis of the sentence

For instance, a sentence beginning with a wh-word and

auxiliary inversion is very likely to be a question, and try-

ing to interpret it as an assertion is wasteful If a verb is

generally ditransitive, one should look for two objects to

t h a t verb instead of one or none Using probabilistic pre-

diction, sentences whose interpretations are highly pre-

dictable based on the trained parsing model can be ana-

lyzed with little wasted effort, generating sometimes no

more than ten spurious constituents for sentences which

contain between 30 and 40 constituents! Also, in some

of these cases every predicted rule results in a completed

predictions and was led astray only by genuine ambigu-

ities in parts of the sentence

3 2 E x h a u s t i v e P r e d i c t i o n

When probabilistic prediction fails to generate the edges

necessary to complete a parse of the sentence, exhaus-

tive prediction uses the edges which have been generated

in earlier phases to predict new edges which might com-

bine with them to produce a complete parse Exhaus-

tive prediction is a combination of two existing types of

prediction, "over-the-top" prediction [11] and top-down

filtering

Over-the-top prediction is applied to complete edges A

completed edge A -+ a will predict all edges of the form

B -+ f l A T 6

Top-down filtering is used to predict edges in order to

complete incomplete edges An edge of the form A 4

a B o B x B 2 f l , where a B1 has been recognized, will predict

edges of the form B0 + 3' before B1 and edges of the

3 3 B i d i r e c t i o n a l P a r s i n g

T h e only difference between phases I and II is that phase

II allows bidirectional parsing Bidirectional parsing is

a technique for initiating the parsing of a constituent

from any point in that constituent Chart parsing algo-

rithms generally process constituents from left-to-right

For instance, given a g r a m m a r rule

6In the i m p l e m e n t a t i o n of "Picky, o v e r - t h e - t o p prediction fi)r

A + o' will only predict edges of the f o r m B -+ A~' T h i s liJnitaticm

on o v e r - t h e - t o p precliction is due to the expensive bookl~eeping

involved in bidirectional p a r s i n g See the section on bidirectional

p a r s i n g for m o r e details

a parser generally would a t t e m p t to recognize a B1, then search for a B2 following it, and so on Bidirectional parsing recognizes an A by looking for any Bi Once a

Bi has been parsed, a bidirectional parser looks for a /3/-1 to the left of the Bi, a Bi+I to the right, and so

o n

Bidirectional parsing is generally an inefficient technique, since it allows duplicate edges to be introduced into the chart As an example, consider a context-free rule NP -+ D E T N, and assume that there is a determiner followed by a noun in the sentence being parsed Using bidirectional parsing, this N P rule can be predicted both by the determiner and by the noun T h e edge predicted by the determiner will look to the right for a noun, find one, and introduce a new edge consisting

of a completed NP The edge predicted by the noun will look to the left for a determiner, find one, and also introduce a new edge consisting of a completed NP Both of these NPs represent identical parse trees, and are thus redundant If the algorithm permits both edges to be inserted into the chart, then an edge XP + ~ N P / 3 will

be advanced by both NPs, creating two copies of every

XP edge These duplicate XP edges can themselves be used in other rules, and so on

To avoid this propagation of redundant edges, the parser must ensure that no duplicate edges are introduced into the chart 79icky does this simply by verifying every time

an edge is added that the edge is not already in the chart Although eliminating redundant edges prevents exces- sive inefficiency, bidirectional parsing may still perform more work than traditional left-to-right parsing In the previous example, three edges are introduced into the chart to parse the NP -+ D E T N edge A left-to-right parser would only introduce two edges, one when the determiner is recognized, and another when the noun is recognized

The benefit of bidirectional parsing can be seen when probabilistic prediction is introduced into the parser Freqneatly, the syntactic structure of a constituent is not determined by its left-corner word For instance,

in the sequence V NP PP, the prepositional phrase P P can modify either the noun phrase NP or the entire verb phrase V NP These two interpretations require different

VP rules to be predicted, but the decision about which rule to use depends on more than just the verb T h e correct rule may best be predicted by knowing the preposi- tion used in the PP Using probabilistic prediction, the decision is made by pursuing the rule which has the highest probability according to the prediction model This rule is then parsed bidirectionally If this rule is in fact the correct rule to analyze the constituent, then no other

4 2

Trang 4

predictions will be made for that constituent, and t h e r e

will be no more edges produced than in left-to-right pars-

ing Thus, the only case where bidirectional Parsing is

less efficient than left-to-right parsing is when the pre-

diction model fails to capture the elements of context of

the sentence which determine its correct interpretation

C o v e r e d L e f t - C o r n e r T h e first phase uses probabilis-

tic prediction based on the part-of-speech sequences from

the input sentence to predict all grammar rules which

have a non-zero probability of being dominated by that

trigram (based on the training corpus), i.e

P ( A 4 BSlaoala2 ) > O i6) where al is the part-of-speech of the left-corner word of

B In this phase, the only exception to the probabilis-

tic prediction is that any rule which can immediately

dominate the preterminal category of any word in the

sentence is also predicted, regardless of its probability

diction All of the predicted rules are processed using a

standard best-first agenda processing algorithm, where

the highest scoring edge in the chart is advanced

C o v e r e d B i d i r e c t i o n a l If a n S spanning the entire

word string is not recognized by the end of the first

phase, the covered bidirectional phase continues the

parsing process Using the chart generated by the first

phase, rules are predicted not only by the trigram cen-

tered at the left-corner word of the rule, but by the

trigram centered at the left-corner word of any of the

children of that rule, i.e

V(A + ,~B*lbob~b2 ) > 0 (7)

where bl is the part-of-speech associated with the left-

most word of constituent B This phase introduces in-

complete theories into the chart which need to be ex-

panded to the left and to the right, as described in the

bidirectional parsing section above

T r e e C o m p l e t i o n If the bidirectional processing fails

to produce a successful parse, then it is assumed that

there is some part of the input sentence which is not

covered well by the training material In the final phase,

exhaustive prediction is performed on all complete the-

ories which were introduced in the previous phases but

which are not predicted by the trigrams beneath t.heln

(i.e V(rule ] trigram) = 0)

In this phase, edges ~tre only predicted by their left-

parsing can be inefficient when the prediction model is

inaccurate Since all edges which the pledictioa model

assigns non-zero probability have already been predicted, the model can no longer provide any information for future predictions Thus, bidirectional parsing in this phase is very likely to be inefficient Edges already in the chart will be parsed bidirectionally, since they were predicted by the model, but all new edges will be predicted by the left-corner word only

Since it is already known that the prediction model will assign a zero probability to these rules, these predictions are instead scored based on the number of words spanned

by the subtree which predicted them Thus, this phase processes longer theories by introducing rules which can advance them Each new theory which is proposed by the parsing process is exhaustively predicted for, using the length-based scoring model

The final phase is used only when a sentence is so far outside of the scope of the training material that none

of the previous phases are able to process it This phase

of the algorithm exhibits the worst-case exponential behavior that is found in chart parsers which do not use node packing Since the probabilistic model is no longer useful in this phase, the parser is forced to propose an enormous number of theories The expectation (or hope)

is that one of the theories which spans most of the sentence will be completed by this final process Depending

on the size of the grammar used, it may be unfeasible

to allow the parser to exhaust all possible predicts before deciding an input is ungrammatical The question

of when the parser should give up is an empiricM issue which will not be explored here

phase has exhausted all predictions made by the grammar, or more likely, once the probability of all edges

in the chart falls below a certain threshold, P i c k y deter- mines the sentence to be ungrammatical However, since the chart produced by 7)icky contains all recognized constituents, sorted by probability, the chart can be used to extract partial parses As implemented, T'icky prints out the most probable completed S constituent

Previous research efforts have produced a wide variety of parsing algorithms for probabilistic and non- probabilistie grammars One might question the need for a new algorithm to deal with context-sensitive probabilistic models However, these previous efforts have generally failed to address both efficiency and robust- hess effe(:ti rely

For noll-probabilistic grammar models, tile CKY algorithm [9] [17] provides efficiency and robustness in poly- nomia.1 time, O(6'n3) C,I(Y can be modified to ha.n-

4 3

Trang 5

dle simple P - C F G s [2] without loss of efficiency How-

ever, with the introduction of context-sensitive proba-

bility models, such as the history-based grammar[l] and

the C F G with CSP models[12], C K Y cannot be mod-

ified to accommodate these models without exhibiting

exponential behavior in the g r a m m a r size G T h e linear

behavior of CKY with respect to g r a m m a r size is depen-

dent upon being able to collapse the distinctions among

constituents of the same type which span the same part

of the sentence However, when using a context-sensitive

probabilistic model, these distinctions are necessary For

instance, in the C F G with CSP model, the part-of-

speech sequence generated by a constituent affects the

probability of constituents that dominate it Thus, two

constituents which generate different part-of-speech se-

quences must be considered individually and cannot be

collapsed

Earley's algorithm [6] is even more attractive than CKY

in terms of efficiency, but it suffers from the same expo-

nential behavior when applied to context-sensitive prob-

abilistic models Still, Earley-style prediction improves

the average case performance of en exponential chart-

parsing algorithm by reducing the size of the search

space, as was shown in [12] However, Earley-style pre-

diction has serious impacts on robust processing of un-

grammatical sentences Once a sentence has been de-

termined to be ungrammatical, Earley-style prediction

prevents any new edges from being added to the parse

chart This behavior seriously degrades the robustness

of a natural language system using this type of parser

A few recent works on probabilistic parsing have pro-

posed algorithms and devices for efficient, robust chart

based probabilistic parsing algorithms, although nei-

rithms use a strictly best first search As both Chitrao

and Magerman[12] observe, a best first search penalizes

longer and more complex constituents (i.e constituents

which are composed of more edges), resulting in thrash-

ing and loss of efficiency Chitrao proposes a heuristic

penalty based on constituent length to deal with this

problem Magerman avoids thrashing by calculating the

score of a parse tree using the geometric mean of the

probabilities of the constituents contained in the tree

Moore[14] discusses techniques for improving the effi-

ciency and robustness of chart parsers for unification

grammars, but the ideas are applicable to probabilistic

grammars as well Some of the techniques proposed are

well-known ideas, such as compiling e-t, ra.nsitions (null

gaps) out of the g r a m m a r and heuristically controlling

the introduction of predictions

The P i c k y parser incorporates what we deem to be the most effective techniques of these previous works into one parsing algorithm New techniques, such as probabilistic prediction and the multi-phase approach, are introduced where the literature does not provide adequate solutions P i c k y combines the standard chart parsing

d a t a structures with existing b o t t o m - u p and top-down parsing operations, and includes a probabilistic version

of top-down filtering and over-the-top prediction P i c k y also incorporates a limited form of bi-directional parsing in a way which avoids its computationally expensive side-effects It uses an agenda processing control mech- anism with the scoring heuristics of Pearl

W i t h the exception of probabilistic prediction, most of the ideas in this work individually are not original to the parsing technology literature However, the combination

of these ideas provides robustness without sacrificing efficiency, and efficiency without losing accuracy

5 R e s u l t s o f E x p e r i m e n t s The P i c k y parser was tested on 3 sets of 100 sentences which were held out from the rest of the corpus during training T h e training corpus consisted of 982 sentences which were parsed using the same g r a m m a r that P i c k y used The training and test corpora are samples from the MIT's Voyager direction-finding system 7 Using Picky's grammar, these test sentences generate, on average, over

100 parses per sentence, with some sentences generated over 1,000 parses

The purpose of these experiments is to explore the im- pact of varying of Picky's parsing algorithm on parsing accuracy, efficiency, and robustness For these experiments, we varied three attributes of the parser: the phases used by parser, the maximum number of edges the parser can produce before failure, and the minimum probability parse acceptable

In the following analysis, the accuracy rate represents the percentage of the test sentences for which the highest probability parse generated by the parser is identical

to the "correct" pa.rse tree indicated in the parsed test

corpus, s Efficiency is measured by two ratios, the prediction ratio

and the completion ratio The prediction ratio is defined

as the ratio of number of predictions made by the parser

7 S p e c i a l t h a n k s t o V i c t o r Z u e a t M I T f o r t h e u s e o f t h e s p e e c h

d a t a f r o m M I T ' s V o y a g e r s y s t e m

8 T h e r e a r e t w o e x c e p t i o n s t o t h i s a c c u r a c y m e a s u r e I f tile

p a r s e r g e n e r a t e s a p l a u s i b l e p a r s e f o r a s e n t e n c e s w h i c h h a s m u l t i - pie p l a u s i b l e i n t e r p r e t a t i o n s , t h e p a r s e is c o n s i d e r e d cc~rrcct A l s o

if t h e p a r s e r g e n e r a t e s a c o r r e c t ; pal'se~ I)ll~ t h e p a r s e c l t e s t c o r p u s

c o n t a i n s a n i n c o r r e c t p a r s e (i.e if t h e r e is a n e r r o r in t h e a n s w e r

k e y ) , t h e p a r s e is c o n s i d e r e d c o l - r e c t

4 4

Trang 6

during the parse of a sentence to the number of con-

ratio is the ratio of the number of completed edges to

the number of predictions during the parse of sentence

Robustness cannot be measured directly by these ex-

periments, since there are few ungrammatical sentences

and there is no implemented method for interpreting the

well-formed substring table when a parse fails However,

for each configuration of the parser, we will explore the

expected behavior of the parser in the face of ungram-

matical input

Since Picky has the power of a pure bottom-up parser,

it would be useful to compare its performance and effi-

ciency to that of a probabilistic bottom-up parser How-

ever, an implementation of a probabilistic b o t t o m - u p

parser using the same grammar produces on average

over 1000 constituents for each sentence, generating over

15,000 edges without generating a parse at all! This

supports our claim that exhaustive CKY-like parsing al-

gorithms are not feasible when probabilistic models are

applied to them

5 1 C o n t r o l C o n f i g u r a t i o n

T h e control for our experiments is the configuration of

Picky with all three phases and with a maximum edge

count of 15,000 Using this configuration, :Picky parsed

the 3 test sets with an 89.3% accuracy rate This is

a slight improvement over Pearl's 87.5% accuracy rate

reported in [12]

Recall that we will measure the efficiency of a parser

configuration by its prediction ratio and completion ratio

on the test sentences A perfect prediction ratio is 1:1,

i.e every edge predicted is used in the eventual parse

However, since there is ambiguity in the input sentences,

a 1:1 prediction ratio is not likely to be achieved Picky's

prediction ratio is approximately than 4.3:1, and its ratio

of predicted edges to completed edges is nearly 1.3:1

Thus, although the prediction ratio is not perfect, on

average for every edge that is predicted more than one

completed constituent results

This is the most robust configuration of P i c k y which will

be attempted in our experiments, since it includes bidi-

rectional parsing (phase II) and allows so many edges to

be created Although there was not a sufficient num-

ber or variety of ungrammatical sentences to explore

the robustness of this configuration further, one inter-

esting example did occur in the test sets The sentence

How do I how do I get to MIT?

is an ungranm~atical but interpretable sentence which

begins with a restart The Pearl parser would have gen-

erated no analysis tbr the latter part of the sentence and

the corresponding sections of the chart would be empty Using bidirectional probabilistic prediction, P i c k y produced a correct partial interpretation of the last 6 words

of the sentence, "how do I get to MIT?" One sentence does not make for conclusive evidence, but it represents the type of performance which is expected from the P i c k y algorithm

5 2 P h a s e s v s E f f i c i e n c y Each of P i c k y ' s three phases has a distinct role in the parsing process Phase I tries to parse the sentences which are most standard, i.e most consistent with the training material Phase II uses bidirectional parsing to try to complete the parses for sentences which are nearly completely parsed by Phase I Phase III uses a simplis- tic heuristic to glue together constituents generated by phases I and II Phase III is obviously inefficient, since it

is by definition processing atypical sentences Phase II

is also inefficient because of the bidirectional predictions added in this phase But phase II also amplifies the inefficiency of phase III, since the bidirectional predictions added in phase II are processed further in phase III

Table 1: Prediction and Completion Ratios and accuracy statistics for P i c k y configured with different subsets of

P i c k y ' s three phases

In Table 1, we see the efficiency and accuracy of P i c k y using different, subsets of the parser's phases Using the control parser (phases I, II, and II), the parser has a 4.3:1 prediction ratio and a 1.3:1 completion ratio

By omitting phase III, we eliminate nearly half of the predictions and half the completed edges, resulting in

a 2.15:1 prediction ratio But this efficiency comes at the cost of coverage, which will be discussed in the next section

By omitting phase II, we observe a slight reduction in predictions, but an increase in completed edges This behavior results from the elimination of the bidirectional predictions, which tend to genera.re duplicate edges Note that this configuration, while slightly more efficient,

4 5

Trang 7

is less robust in processing ungrammatical input

5 3 P h a s e s v s A c c u r a c y

For some natural language applications, such as a natu-

ral language interface to a nuclear reactor or to a com-

puter operating system, it is imperative for the user to

have confidence in the parses generated by the parser

P i c k y has a relatively high parsing accuracy rate of

nearly 90%; however, 10% error is far too high for fault-

intolerant applications

Table 2: 7~icky's parsing accuracy, categorized by the

phase which the parser reached in processing the test

sentences

Consider the d a t a in Table 2 While the parser has an

overall accuracy rate of 89.3%, it is.far more accurate on

sentences which are parsed by phases I and II, at 97%

Note that 238 of the 300 sentences, or 79%, of the test

sentences are parsed in these two phases Thus, by elimi-

nating phase III, the percent error can be reduced to 3%,

while maintaining 77% coverage An alternative to elim-

inating phase III is to replace the length-based heuristic

of this phase with a secondary probabilistic model of the

difficult sentences in this domain This secondary model

might be trained on a set of sentences which cannot be

parsed in phases I and II

In the original implementation of the P i c k y algorithm,

we intended to allow the parser to generate edges un-

til it found a complete interpretation or exhausted all

possible predictions However, for some ungrammati-

cal sentences, the parser generates tens of thousands of

edges without terminating To limit the processing time

for the experiments, we implemented a maximum edge

count which was sufficiently large so that all grammat-

ical sentences in the test corpus would be parsed All

of the grammatical test sentences generated a parse be-

fore producing 15,000 edges However, some sentences

produced thousands of edges only to generate an incor-

rect parse In fact, it seemed likely tha,t there might be

a correlation between very high edge counts and incor-

rect parses We tested this hypothesis by varying the

maximum edge count

In Table 3, we see an increase in efficiency and a decrease

Table 3: Prediction and Completion Ratios and accuracy statistics for 7~icky configured with different m a x i m u m edge count

in accuracy as we reduce the m a x i m u m n u m b e r of edges the parser will generate before declaring a sentence ungrammatical By reducing the m a x i m u m edge count by

a factor of 50, from 15,000 to 300, we can nearly cut

in half the number of predicts and edges generated by the parser And while this causes the accuracy rate to fall from 89.3% to 79.3%, it also results in a significant decrease in error rate, down to 2.7% By decreasing the maximum edge count down to 150, the error rate can be reduced to 1.7%

5 5 P r o b a b i l i t y v s A c c u r a c y Since a probability represents the likelihood of an interpretation, it is not unreasonable to expect the p r o b a b i l - ity of a parse tree to be correlated with the accuracy of the parse However, based on the probabilities associated with the "correct" parse trees of the test sentences, there appears to be no such correlation Many of the test sentences had correct parses with very low probabilities (10-1°), while others had much higher probabilities (10-2) And the probabilities associated with incorrect parses were not distinguishable from the probabilities of correct parses

T h e failure to find a correlation between probability a.nd accuracy in this experiment does not prove conclusively that no such correlation exists Admittedly, the training corpus used for all of these experiments is far smaller than one would hope to estimate the CFG with CSP model parameters Thus, while the model is trained well enough to steer the parsing search, it may not be sufficiently trained to provide meaningful probability values

6 C o n c l u s i o n s There are many different applications of natural language parsing, and each application has a different cost threshold for efficiency, robustness, and accuracy '['he

"Pick), algorithm introduces a framework for integral.ing

4 6

Trang 8

these thresholds into the configuration of the parser i n

order to maximize the effectiveness of the parser for the

task at hand An application which requires a high de-

gree of accuracy would o m i t the Tree Completion phase

of the parser A real-time application would limit the

number of edges generated by the parser, likely at the

cost of accuracy An application which is robust to er-

rors but requires efficient processing of input would omit

the Covered Bidirectional phase

The :Picky parsing algorithm illustrates how probabilis-

tic modelling of natural language can be used to improve

the efficiency, robustness, and accuracy of natural lan-

guage understanding tools

R E F E R E N C E S

1 Black, E., Jelinek, F., Lafferty, J., M~german, D M.,

Mercer, R and Roukos, S 1992 Towards History-based

Grammars: Using Richer Models of Context in Prob-

abilistic Parsing In Proceedings of the February 1992

DARPA Speech and Natural Language Workshop Ar-

den House, NY

2 Brown, P., Jelinek, F., and Mercer, R 1991 Basic

Method of Probabilistic Context-free Grammars IBM

Internal Report Yorktown Heights, NY

3 Bobrow, R J 1991 Statistical Agenda Parsing In Pro-

ceedings of the February 1991 DARPA Speech and Nat-

ural Language Workshop Asilomar, California

4 Chitrao, M and Grishman, R 1990 Statistical Parsing

of Messages In Proceedings of the June 1990 DARPA

Speech and Natural Language Workshop Hidden Valley,

Pennsylvania

5 Church, K 1988 A Stochastic Parts Program and Noun

Phrase Parser for Unrestricted Text In Proceedings of

the Second Conference on Applied Natural Language

Processing Austin, Texas

6 Earley, J 1970 An Efficient Context-Free Parsing Algo-

94-102

7 Gale, W A and Church, K 1990 Poor Estimates of

Context are Worse than None In Proceedings of the

June 1990 DARPA Speech and Natural Language Work-

shop Hidden Valley, Pennsylvania

8 Jelinek, F 1985 Self-orgmlizing Language Modeling for

Speech Recognition IBM Report

9 Kasami, T 1965 An Efficient Recognition and Syn-

tax Algorithm for Context-Free Languages Scientific

Report AFCRL-65-758, Air Force Cambridge Research

Laboratory Bedford, Massachusetts

10 Kay, M 1980 Algorithm Schemata and Data Structures

11 Kimball, J 1973 Principles of Surface Structure Parsing

12 Magerman, D M and Marcus, M P 1991 Pearl: A

Probabilistic Chart Parser In Proceedings of the Euro-

pean ACL Conference, Mavcli 1991 Berlin, Germany

13 Magerman, D M and Weir, C 1992 Probabilisti¢: Pre-

diction and Picky Chart Parsing In Proceedings of the

4 7

February 1992 DARPA Speech and Natural Language Workshop Arden House, NY

14 Moore, R and Dowding, J 1991 Efficient Bottom-Up Parsing In Proceedings of the February 1991 DARPA Speech and Natural Language Workshop Asilomar, Cal- ifornia

15 Sharman, R A., Jelinek, F., and Mercer, R 1990 Gen- erating a Grammar for Statistical Training In Proceed- ings of the June 1990 DARPA Speech and Natural Lan- guage Workshop Hidden Valley, Pennsylvania

16 Seneff, Stephanie 1989 TINA In Proceedings of the Au- gust 1989 International Workshop in Parsing Technolo- gies Pittsburgh, Pennsylvania

17 Younger, D H 1967 Recognition and Parsing of

ControlVol 10, No 2, pp 189-208

Tiêu đề	Efficiency, robustness and accuracy in picky chart parsing
Tác giả	David M. Magerman, Carl Weir
Trường học	Stanford University
Thể loại	báo cáo khoa học
Thành phố	Stanford

Định dạng
Số trang	8
Dung lượng	797,67 KB