Tài liệu Báo cáo khoa học: "Optimizing Typed Feature Structure Grammar Parsing through Non-Statistical Indexing" doc

Toronto M5S 3G4 Canada mcosmin,gpenn @cs.toronto.edu Abstract This paper introduces an indexing method based on static analysis of grammar rules and type signatures for typed feature str

Trang 1

Optimizing Typed Feature Structure Grammar Parsing through

Non-Statistical Indexing Cosmin Munteanu and Gerald Penn

University of Toronto

10 King’s College Rd

Toronto M5S 3G4 Canada mcosmin,gpenn @cs.toronto.edu

Abstract

This paper introduces an indexing method based on

static analysis of grammar rules and type signatures

for typed feature structure grammars (TFSGs) The

static analysis tries to predict at compile-time which

feature paths will cause unification failure during

parsing at run-time To support the static analysis,

we introduce a new classification of the instances

of variables used in TFSGs, based on what type of

structure sharing they create The indexing actions

that can be performed during parsing are also

enu-merated Non-statistical indexing has the

advan-tage of not requiring training, and, as the

evalua-tion using large-scale HPSGs demonstrates, the

im-provements are comparable with those of statistical

optimizations Such statistical optimizations rely

on data collected during training, and their

perfor-mance does not always compensate for the training

costs

1 Introduction

Developing efficient all-paths parsers has been a

long-standing goal of research in computational

lin-guistics One particular class still in need of

pars-ing time improvements is that of TFSGs While

simpler formalisms such as context-free grammars

(CFGs) also face slow all-paths parsing times when

the size of the grammar increases significantly,

TF-SGs (which generally have fewer rules than

large-scale CFGs) become slow as a result of the

com-plex structures used to describe the grammatical

egories In HPSGs (Pollard and Sag, 1994), one

cat-egory description could contain hundreds of feature

values This has been a barrier in transferring

CFG-successful techniques to TFSG parsing

For TFSG chart parsers, one of the most

time-consuming operations is the retrieval of categories

from the chart during rule completion (closing of

constituents in the chart under a grammar rule)

Looking in the chart for a matching edge for a

daughter is accomplished by attempting unifications

with edges stored in the chart, resulting in many

failed unifications The large and complex structure

of TFS descriptions (Carpenter, 1992) leads to slow unification times, affecting the parsing times Thus, failing unifications must be avoided during retrieval from the chart

To our knowledge, there have been only four methods proposed for improving the retrieval com-ponent of TFSG parsing One (Penn and Munteanu, 2003) addresses only the cost of copying large cate-gories, and was found to reduce parsing times by an average of 25% on a large-scale TFSG (MERGE) The second, a statistical method known as quick-check (Malouf et al., 2000), determines the paths that are likely to cause unification failure by pro-filing a large sequence of parses over representa-tive input, and then filters unifications at run-time

by first testing these paths for type consistency This was measured as providing up to a 50% im-provement in parse times on the English Resource Grammar (Flickinger, 1999, ERG) The third (Penn, 1999b) is a similar but more conservative approach that uses the profile to re-order sister feature values

in the internal data structure This was found to im-prove parse times on the ALE HPSG by up to 33% The problem with these statistical methods is that the improvements in parsing times may not jus-tify the time spent on profiling, particularly during grammar development The static analysis method introduced here does not use profiling, although it does not preclude it either Indeed, an evaluation of statistical methods would be more relevant if mea-sured on top of an adequate extent of non-statistical optimizations Although quick-check is thought to produce parsing time improvements, its evaluation used a parser with only a superficial static analysis

of chart indexing

That analysis, rule filtering (Kiefer et al., 1999), reduces parse times by filtering out mother-daughter unifications that can be determined to fail at compile-time True indexing organizes the data (in this case, chart edges) to avoid unnecessary re-trievals altogether, does not require the operations that it performs to be repeated once full unification

Trang 2

is deemed necessary, and offers the support for

eas-ily adding information extracted from further static

analysis of the grammar rules, while maintaining

the same indexing strategy Flexibility is one of the

reasons for the successful employment of indexing

in databases (Elmasri and Navathe, 2000) and

auto-mated reasoning (Ramakrishnan et al., 2001)

In this paper, we present a general scheme for

in-dexing TFS categories during parsing (Section 3)

We then present a specific method for statically

an-alyzing TFSGs based on the type signature and the

structure of category descriptions in the grammar

rules, and prove its soundness and completeness

(Section 4.2.1) We describe a specific indexing

strategy based on this analysis (Section 4), and

eval-uate it on two large-scale TFSGs (Section 5) The

result is a purely non-statistical method that is

com-petitive with the improvements gained by statistical

optimizations, and is still compatible with further

statistical improvements

2 TFSG Terminology

TFSs are used as formal representatives of rich

grammatical categories In this paper, the

formal-ism from (Carpenter, 1992) will be used A TFSG

is defined relative to a fixed set of types and set of

features, along with constraints, called

appropriate-ness conditions These are collectively known as

the type signature (Figure 3) For each type,

ap-propriateness specifies all and only the features that

must have values defined in TFSs of that type It

also specifies the types of the values that those

fea-tures can take The set of types is partially ordered,

and has a unique most general type ( – “bottom”)

This order is called subsumption ( ): more specific

(higher) types inherit appropriate features from their

more general (lower) supertypes Two types t1 and

t2 unify (t1 t2

) iff they have a least upper bound

in the hierarchy Besides a type signature, TFSGs

contain a set of grammar (phrase) rules and lexical

descriptions A simple example of a lexical

descrip-tion is: john SYNSEM: SYN: np SEM: j , while

an example of a phrase rule is given in Figure 1

SYN: s SEM:V PSem AGENT: NPSem

SYN: np AGR: Agr SEM: NPSem ,

SYN: vp AGR: Agr SEM: V PSem

Figure 1: A phrase rule stating that the syntactic category

s can be combined from np and vp if their values for

agr are the same The semantics of s is that of the verb

phrase, while the semantics of the noun phrase serves as

agent.

2.1 Typed Feature Structures

A TFS (Figure 2) is like a recursively defined record

in a programming language: it has a type and fea-tures with values that can be TFSs, all obeying the appropriateness conditions of the type signature TFSs can also be seen as rooted graphs, where arcs correspond to features and nodes to substructures A node typing functionθq associates a type to every

node q in a TFS Every TFS F has a unique starting

or root node, q F For a given TFS, the feature value partial functionδ f q specifies the node reachable

from q by feature f when one exists The path value

partial functionδπ q specifies the node reachable

from q by following a path of features πwhen one exists TFSs can be unified as well The result repre-sents the most general consistent combination of the information from two TFSs That information in-cludes typing (by unifying the types), feature values (by recursive unification), and structure sharing (by

an equivalence closure taken over the nodes of the arguments) For large TFSs, unification is compu-tationally expensive, since all the nodes of the two TFSs are visited In this process, many nodes are collapsed into equivalence classes because of

struc-ture sharing A node x in a TFS F with root q F and

a node x in a TFS F with root q F are equivalent ( ) with respect to F F iff x q F and x q F,

or if there is a pathπsuch thatδF F π q F x and

δF F π q F

NUMBER:

PERSON:

GENDER: masculine

third [1]singular

NUMBER:

PERSON:

GENDER:

third neuter [1]

throwing

THROWER: index

THROWN: index

Figure 2: A TFS Features are written in uppercase, while types are written with bold-face lowercase

Struc-ture sharing is indicated by numerical tags, such as [1].

THROWER:

THROWN:

index index

masculine feminine neuter singular plural first second third

num

PERSON: GENDER: NUMBER:

pers num gend

Figure 3: A type signature For each type, appropriate-ness declares the features that must be defined on TFSs

of that type, along with the type restrictions applying to their values.

Trang 3

2.2 Structure Sharing in Descriptions

TFSGs are typically specified using descriptions,

which logically denote sets of TFSs Descriptions

can be more terse because they can assume all of

the information about their TFSs that can be

in-ferred from appropriateness Each non-disjunctive

description can be associated with a unique most

general feature structure in its denotation called a

most general satisfier (MGSat). While a formal

presentation can be found in (Carpenter, 1992), we

limit ourselves to an intuitive example: the TFS

from Figure 2 is the MGSat of the description:

throwing THROWER: PERSON: third NUMBER:

singular Nr GENDER: masculine THROWN:

PERSON: third NUMBER: NrGENDER: neuter

Descriptions can also contain variables, such as Nr.

Structure sharing is enforced in descriptions

through the use of variables In TFSGs, the scope

of a variable extends beyond a single description,

re-sulting in structure sharing between different TFSs

In phrase structure rules (Figure 1), this sharing

can occur between different daughter categories in

a rule, or between a mother and a daughter Unless

the term description is explicitly used, we will use

“mother” and “daughter” to refer to the MGSat of a

mother or daughter description

We can classify instances of variables based on

what type of structure sharing they create

Inter-nal variables are the variables that represent

inter-nal structure sharing (such as in Figure 2) The

oc-currences of such variables are limited to a single

category in a phrase structure rule External

vari-ables are the varivari-ables used to share structure

be-tween categories If a variable is used for

struc-ture sharing both inside a category and across

cat-egories, then it is also considered an external

vari-able For a specific category, two kinds of external

variable instances can be distinguished, depending

on their occurrence relative to the parsing control

strategy: active external variables and inactive

ex-ternal variables Active exex-ternal variables are

in-stances of external variables that are shared between

the description of a category D and one or more

de-scriptions of categories in the same rule as D

vis-ited by the parser before D as the rule is extended

(completed) Inactive external variables are the

ternal variable instances that are not active For

ex-ample, in bottom-up left-to-right parsing, all of a

mother’s external variable instances would be active

because, being external, they also occur in one of

the daughter descriptions Similarly, all of the

left-most daughter’s external variable instances would

be inactive because this is the first description used

by the parser In Figure 1, Agr is an active external

variable in the second daughter, but it is inactive in the first daughter

The active external variable instances are im-portant for path indexing (Section 4.2), because they represent the points at which the parser must copy structure between TFSs They are therefore substructures that must be provided to a rule by the parsing chart if these unifications could poten-tially fail They also represent shared nodes in the

MGSats of a rule’s category descriptions In our

definitions, we assume without loss of generality that parsing proceeds bottom-up, with left-to-right

of rule daughters This is the ALE system’s (Car-penter and Penn, 1996) parsing strategy

Definition 1 If D1 D n are daughter de-scriptions in a rule and the rules are extended from left to right, then ExtMGSatD i is the set of nodes shared between MGSatD i and MGSatD1

de-scription M, ExtMGSatM is the set of nodes shared with any daughter in the same rule.

Because the completion of TFSG rules can cause the categories to change in structure (due to exter-nal variable sharing), we need some extra notation

to refer to a phrase structure rule’s categories at dif-ferent times during a single application of that rule By

M we symbolize the mother M after M’s rule is

completed (all of the rule’s daughters are matched with edges in the chart)

D symbolizes the

daugh-ter D afdaugh-ter all daughdaugh-ters to D’s left in D’s rule were

unified with edges from the chart An important

re-lation exists between M and

M: if q M is M’s root and

q Mis

M’s root, then

x M

x

M such that πfor whichδπ q M x andδπ

q M

x, θx θ

x

In other words, extending the rule extends the in-formation states of its categories monotonically A

similar relation exists between D and

D The set of

all nodes x in M such that πfor whichδπ q M x

andδπ

q M

x will be denoted by

x

1(and

like-wise for nodes in D) There may be more than one

node in

x

1because of unifications that occur

dur-ing the extension of M to

M.

3 The Indexing Timeline

Indexing can be applied at several moments dur-ing parsdur-ing We introduce a general strategy for in-dexed parsing, with respect to what actions should

be taken at each stage

Three main stages can be identified The first one consists of indexing actions that can be taken off-line (along with other optimizations that can be performed at compile-time) The second and third stages refer to actions performed at run time

Trang 4

Stage 1 In the off-line phase, a static analysis

of grammar rules can be performed The complete

content of mothers and daughters may not be

ac-cessible, due to variables that will be instantiated

during parsing, but various sources of information,

such as the type signature, appropriateness

specifi-cations, and the types and features of mother and

daughter descriptions, can be analyzed and an

ap-propriate indexing scheme can be specified This

phase of indexing may include determining: (1a)

which daughters in which rules will certainly not

unify with a specific mother, and (1b) what

informa-tion can be extracted from categories during parsing

that can constitute indexing keys It is desirable to

perform as much analysis as possible off-line, since

the cost of any action taken during run time

pro-longs the parsing time

Stage 2 During parsing, after a rule has been

completed, all variables in the mother have been

ex-tended as far as they can be before insertion into

the chart This offers the possibility of further

in-vestigating the mother’s content and extracting

sup-plemental information from the mother that

con-tributes to the indexing keys However, the choice

of such investigative actions must be carefully

stud-ied, since it might burden the parsing process

Stage 3. While completing a rule, for each

daughter a matching edge is searched in the chart

At this moment, the daughter’s active external

vari-ables have been extended as far as they can be

be-fore unification with a chart edge The information

identified in stage (1b) can be extracted and unified

as a precursor to the remaining steps involved in

cat-egory unification These steps also take place at this

stage

4 TFSG Indexing

To reduce the time spent on failures when

search-ing for an edge in the chart, each edge (edge’s

cat-egory) has an associated index key which uniquely

identifies the set of daughter categories that can

po-tentially match it When completing a rule, edges

unifying with a specific daughter are searched for in

the chart Instead of visiting all edges in the chart,

the daughter’s index key selects a restricted number

of edges for traversal, thus reducing the number of

unification attempts

The passive edges added to the chart represent

specializations of rules’ mothers When a rule is

completed, its mother M is added to the chart

ac-cording to M’s indexing scheme, which is the set of

index keys of daughters that might possibly unify

with M The index is implemented as a hash, where

the hash function applied to a daughter yields the

daughter’s index key (a selection of chart edges)

For a passive edge representing M, M’s

index-ing scheme provides the collection of hash entries where it will be added

Each daughter is associated with a unique index key During parsing, a specific daughter is searched for in the chart by visiting only those edges that have

a matching key, thus reducing the time needed for traversing the chart The index keys can be com-puted off-line (when daughters are indexed by posi-tion), or during parsing

4.1 Positional Indexing

In positional indexing, the index key for each daughter is represented by its position (rule number and daughter position in the rule) The structure of the index can be de-termined at compile-time (first stage) For

each mother M in the grammar, a collection

LM R i D j daughters that can match M is

created (M’s indexing scheme), where each element

ofLM represents the rule number R iand daughter

position D j inside rule R i (1 j arityR i ) of a

category that can match with M.

For TFSGs it is not possible to compute off-line the exact list of mother-daughter matching pairs, but

it is possible to rule out certain non-unifiable pairs before parsing — a compromise that pays off with a very low index management time

During parsing, each time an edge (representing

a rule’s mother M) is added to the chart, it is

in-serted into the hash entries associated with the po-sitions R i D j from the list LM (the number of

entries where M is inserted is LM ) The entry associated with the key R i D j will contain only categories that can possibly unify with the daughter

at position R i D j in the grammar

Because our parsing algorithm closes categories depth-first under leftmost daughter matching, only

daughters D i with i 2 are searched for in the chart (and consequently, indexed) We used the EFD-based modification of this algorithm (Penn and Munteanu, 2003), which needs no active edges, and requires a constant two copies per edges, rather than the standard one copy per retrieval found in Prolog parsers Without this, the cost of copying TFS categories would have overwhelmed the bene-fit of the index

4.2 Path Indexing

Path indexing is an extension of positional index-ing Although it shares the same underlying prin-ciple as the path indexing used in automated rea-soning (Ramakrishnan et al., 2001), its functionality

is related to quick check: extract a vector of types

Trang 5

from a mother (which will become an edge) and a

daughter, and test the unification of the two vectors

before attempting to unify the edge and the

daugh-ter Path indexing differs from quick-check in that

it identifies these paths by a static analysis of

gram-mar rules, performed off-line and with no training

required Path indexing is also built on top of

po-sitional indexing, therefore the vector of types can

be different for each potentially unifiable

mother-daughter pair

4.2.1 Static Analysis of Grammar Rules

Similar to the abstract interpretation used in

pro-gram verification (Cousot and Cousot, 1992),

the static analysis tries to predict a run-time

phenomenon (specifically, unification failures) at

compile-time It tries to identify nodes in a mother

that carry no relevant information with respect to

unification with a particular daughter For a mother

M unifiable with a daughter D, these nodes will

be grouped in a set StaticCutM D Intuitively,

these nodes can be left out or ignored while

com-puting the unification of

M and

D The StaticCut

can be divided into two subsets: StaticCutM D

RigidCutM D VariableCutM D

The RigidCut represents nodes that can be left out

because neither they, nor one of theirδπ-ancestors,

can have their type values changed by means of

ex-ternal variable sharing The VariableCut represents

nodes that are either externally shared, or have an

externally shared ancestor, but still can be left out

Definition 2 RigidCutM D is the largest subset

of nodes x M such that,

y D for which x y:

1 x ExtM , y ExtD ,

2.

x M s.t. πs.t.δπ x x, x ExtM , and

3.

y D s.t. πs.t.δπ y y, y ExtD .

Definition 3 VariableCut is the largest subset of

nodes x M such that:

1 x RigidCutM D , and

2.

y D for which x y,

s θx

t θy ,

s t exists.

In words, a node can be left out even if it is

ex-ternally shared (or has an exex-ternally shared

ances-tor) if all possible types this node can have unify

with all possible types its corresponding nodes in

D can have Due to structure sharing, the types of

nodes in M and D can change during parsing, by

being specialized to one of their subtypes

Condi-tion 2 ensures that the types of these nodes will

re-main compatible (have a least upper bound), even if

they specialize during rule completion An intuitive

example (real-life examples cannot be reproduced

here — a category in a typical TFSG can have

hun-dreds of nodes) is presented in Figure 4

y2

y1

y3 y5 t1

y4 t1

t5 F:

G:

H:

G: K:

D

x1

x4

F: H:

G:

I:

t3 t1

G:t1 H:t6 F:t6

K:t1 I:t3

t1

t2

J:t5

t6

t0

T

t8

M

Figure 4: Given the above type signature, mother M and daughter D (externally shared nodes are pointed to by dashed arrows), nodes x1x2 and x3from M can be left out when unifying M with D during parsing x1and x3

RigidCutMD , while x2

VariableCutMD ( θ y2

can promote only to t7, thus x2 and y2 will always be

compatible) x4is not included in the StaticCut, because

if θ y5 promotes to t5, then θ y4 will promote to t5(not

unifiable with t3).

When computing the unification between a mother and a daughter during parsing, the same out-come (success or failure) will be reached by using

a reduced representation of the mother (

MsD), with

nodes in StaticCutM D removed from

M.

Proposition 1 For a mother M and a daughter D,

if M D

before parsing, and

M (as an edge in the chart) and

D exist, then during parsing: (1)

MsD

D

M

D

, (2)

MsD

D

M

D Proof The second part (

MsD

D

M

D )

of Proposition 1 has a straightforward proof: if

MsD

D , then

z

MsD

D such that t for

which

x

z t θ

x Since

MsD

M,

z

M

D such that t for which

x

z t θ

x , and therefore,

M

D The first part of the proposition will be proven by showing that

z

M

D, a consistent type can be

assigned to

z , where

z is the set of nodes in

M

and

D equivalent to

z with respect to the unification

of

M and

D.1

Three lemmata need to be formulated:

Lemma 1 If

x

M and x

x

1, thenθ

x θx Similarly, for

y

D, y

y

1,θ

y θy .

Lemma 2 If types t0 t1 t n are such that

t0

, then t t0such that

i

1 Because we do not assume inequated TFSs (Carpenter, 1992) here, unification failure must result from type inconsis-tency.

Trang 6

Lemma 3 If

x

M and

y

D for which

x

y, then

x

1

y

1such that x y.

In proving the first part of Proposition 1, four

cases are identified: Case A:

z

M 1 and

z

D 1, Case B:

z

M 1 and

z

D 1, Case C:

z

M 1 and

z

D 1,

Case D:

z

M 1 and

z

D 1 Case A

is trivial, and D is a generalization of B and C

Case B It will be shown that t Type such that

y

z

D and for

x

z

M, t θ

y and

t θ

x

Subcase B.i:

x

M

x

MsD

y

z

D,

y

x. Therefore, according to Lemma 3, x

x

1

y

1 such that x y Thus, according

to Condition 2 of Definition 3,

s θy

t θx ,

s t

But according to Lemma 1,θ

y θy and

θ

x θx Therefore,

y

z

D,

s θ

y ,

t θ

x , s t

, and hence,

y

z

D

t

θ

x t θ

y

Thus, according to Lemma 2, t

θ

x

y

z

D, t θ

y

Subcase B.ii:

x

M

x

MsD Since

MsD

D

,

x such that

y

z

D, t θ

y

Case C It will be shown that t θ

y such that

x

z , t θ

x Let

y

z

D The

set

z

M can be divided into two subsets: S ii

x

z

M

x

MsD

, and S i

x

z

M

x

M

x

MsD , and x VariableCutM D If x

were in RigidCutM D , then necessarily

z

M

would be 1 Since S ii

MsD and

MsD

D

, then

y such that

x S ii t θ

x (*) How-ever,

x S ii,

x

y. Therefore, according to Lemma 3,

x S ii x

x

1

y

1 such that

x y Thus, since x VariableCutM D ,

Condi-tion 2 of DefiniCondi-tion 3 holds, and therefore,

accord-ing to Lemma 1,

s1 θ

x

s2 θ

y s1 s2

More than this, since t θ

y (for the type t from (*)),

s1 θ

x

s2 t s1 s2

, and hence,

s2

t s2 θ

x

Thus, according to Lemma 2 and to

(*), t t θ

y such that

x S ii t θ

x Thus,

t such that

x

z , t θ

x While Proposition 1 could possibly be used by

grammar developers to simplify TFSGs themselves

at the source-code level, here we only exploit it for

internally identifying index keys for more efficient

chart parsing with the existing grammar There may

be better static analyses, and better uses of this static

analysis In particular, future work will focus on

us-ing static analysis to determine smaller

representa-tions (by cutting nodes in Static Cuts) of the chart

edges themselves

4.2.2 Building the Path Index

The indexing schemes used in path indexing are built on the same principles as those in positional indexing The main difference is the content of the indexing keys, which now includes a third element

Each mother M has its indexing scheme defined as:

LM R i D j V ij The pair R i D j is the

po-sitional index key (as in popo-sitional indexing), while

V ij is the path index vector containing type values extracted from M A different set of types is

ex-tracted for each mother-daughter pair So, path in-dexing uses a two-layer inin-dexing method: the po-sitional key for daughters, and types extracted from the typed feature structure Each daughter’s index key is now given byLD j R i V ij , where R i

is the rule number of a potentially matching mother,

and V ij is the path index vector containing types

ex-tracted from D j The types extracted for the indexing vectors

are those of nodes found at the end of indexing

paths A path π is an indexing path for a mother-daughter pair M D iff: (1)πis defined for both M and D, (2) x StaticCutM D f s.t.δ f x

δπ q M (q M is M’s root), and (3) δπ q M

StaticCutM D Indexing paths are the “frontiers”

of the non-statically-cut nodes of M.

A similar key extraction could be performed dur-ing Stage 2 of indexdur-ing (as outlined in Section 3), using

M rather than M We have found that this

on-line path discovery is generally too expensive to be performed during parsing, however

As stated in Proposition 1, the nodes in

StaticCutM D do not affect the success/failure of

M

D. Therefore, the types of first nodes

not included in StaticCutM D along each path

π that stems from the root of M and D are

in-cluded in the indexing key, since these nodes might contribute to the success/failure of the

unifica-tion It should be mentioned that the vectors V ij

are filled with values extracted from

M after M’s

rule is completed, and from

D after all

daugh-ters to the left of D are unified with edges in the

chart As an example, assuming that the index-ing paths are THROWER:PERSON, THROWN, and

TFS shown in Figure 2 isthird index neuter

4.2.3 Using the Path Index

Inserting and retrieving edges from the chart using path indexing is similar to the general method pre-sented at the beginning of this section The first layer of the index is used to insert a mother as

an edge into appropriate chart entries, according to the positional keys for the daughters it can match

Trang 7

Along with the mother, its path index vector is

in-serted into the chart

When searching for a matching edge for a

daugh-ter, the search is restricted by the first indexing layer

to a single entry in the chart (labeled with the

posi-tional index key for the daughter) The second layer

restricts searches to the edges that have a

compati-ble path index vector The compatibility is defined

as type unification: the type pointed to by the

el-ement V ijn of an edge’s vector V ij should unify

with the type pointed to by the element V ijn of the

path index vector V ij of the daughter on position D j

in a rule R i

5 Experimental Evaluation

Two TFSGs were used to evaluate the performance

of indexing: a pre-release version of the MERGE

grammar, and the ALE port of the ERG (in its final

form) MERGE is an adaptation of the ERG which

uses types more conservatively in favour of

rela-tions, macros and complex-antecedent constraints

This pre-release version has 17 rules, 136 lexical

items, 1157 types, and 144 introduced features The

ERG port has 45 rules, 1314 lexical entries, 4305

types and 155 features MERGE was tested on 550

sentences of lengths between 6 and 16 words,

ex-tracted from the Wall Street Journal annotated parse

trees (where phrases not covered by MERGE’s

vo-cabulary were replaced by lexical entries having the

same parts of speech), and from MERGE’s own

test corpus ERG was tested on 1030 sentences of

lengths between 6 and 22 words, extracted from the

Brown Corpus and from the Wall Street Journal

an-notated parse trees

Rather than use the current version of ALE, TFSs

were encoded as Prolog terms as prescribed in

(Penn, 1999a), where the number of argument

po-sitions is the number of colours needed to colour

the feature graph This was extended to allow for

the enforcement of type constraints during TFS

uni-fication Types were encoded as attributed variables

in SICStus Prolog (Swedish Institute of Computer

Science, 2004)

5.1 Positional and path indexing evaluation

The average and best improvements in parsing times

of positional and path indexing over the same

EFD-based parser without indexing are presented in

Ta-ble 1 The parsers were implemented in SICStus

3.10.1 for Solaris 8, running on a Sun Server with 16

GB of memory and 4 UltraSparc v.9 processors at

1281 MHz For MERGE, parsing times range from

10 milliseconds to 1.3 seconds For ERG, parsing

times vary between 60 milliseconds and 29.2

sec-onds

Positional Index Path Index

Table 1: Parsing time improvements of positional and path indexing over the non-indexed EFD parser.

5.2 Comparison with statistical optimizations

Non-statistical optimizations can be seen as a first step toward a highly efficient parser, while statistical optimization can be applied as a second step How-ever, one of the purposes of non-statistical index-ing is to eliminate the burden of trainindex-ing while of-fering comparable improvements in parsing times

A quick-check parser was also built and evaluated and the set-up times for the indexed parsers and the quick-check parser were compared (Table 2) Quick-check was trained on a 300-sentence training corpus, as prescribed in (Malouf et al., 2000) The training corpus included 150 sentences also used in testing The number of paths in path indexing is dif-ferent for each mother-daughter pair, ranging from

1 to 43 over the two grammars

Table 2: The set-up times for non-statistically indexed parsers and statistically optimized parsers for MERGE.

As seen in Table 3, quick-check alone surpasses positional and path indexing for the ERG How-ever, it is outperformed by them on the MERGE, recording slower times than even the baseline But the combination of quick-check and path indexing

is faster than quick-check alone on both grammars Path indexing at best provided no decrease in per-formance over positional indexing alone in these ex-periments, attesting to the difficulty of maintaining efficient index keys in an implementation

Table 3: Comparison of average improvements over non-indexed parsing among all parsers.

The quick-check evaluation presented in (Malouf

et al., 2000) uses only sentences with a length of

at most 10 words, and the authors do not report the set-up times Quick-check has an additional advan-tage in the present comparison, because half of the training sentences were included in the test corpus While quick-check improvements on the ERG confirm other reports on this method, it must be

Trang 8

Grammar Successful Failed unifications Failure rate reduction (vs no index)

Table 4: The number of successful and failed unifications for the non-indexed, positional indexing, path indexing, and quick-check parsers, over MERGE and ERG (collected on the slowest sentence in the corresponding test sets.)

noted that quick-check appears to be parochially

very well-suited to the ERG (indeed quick-check

was developed alongside testing on the ERG)

Al-though the recommended first 30 most probable

failure-causing paths account for a large part of

the failures recorded in training on both grammars

(94% for ERG and 97% for MERGE), only 51 paths

caused failures at all for MERGE during training,

compared to 216 for the ERG Further training with

quick-check for determining a better vector length

for MERGE did not improve its performance

This discrepancy in the number of failure-causing

paths could be resulting in an overfitted quick-check

vector, or, perhaps the 30 paths chosen for MERGE

really are not the best 30 (quick-check uses a greedy

approximation) In addition, as shown in Table 4,

the improvements made by quick-check on the ERG

are explained by the drastic reduction of (chart

look-up) unification failures during parsing relative to the

other methods It appears that nothing short of a

drastic reduction is necessary to justify the overhead

of maintaining the index, which is the largest for

quick-check because some of its paths must be

tra-versed at run-time — path indexing only uses paths

available at compile-time in the grammar source

Note that path indexing outperforms quick-check on

MERGE in spite of its lower failure reduction rate,

because of its smaller overhead

6 Conclusions and Future Work

The indexing method proposed here is suitable for

several classes of unification-based grammars The

index keys are determined statically and are based

on an a priori analysis of grammar rules A

ma-jor advantage of such indexing methods is the

elim-ination of the lengthy training processes needed

by statistical methods Our experimental

evalu-ation demonstrates that indexing by static

analy-sis is a promising alternative to optimizing parsing

with TFSGs, although the time consumed by on-line

maintenance of the index is a significant concern —

echoes of an observation that has been made in

ap-plications of term indexing to databases and

pro-gramming languages (Graf, 1996) Further work

on efficient implementations and data structures is

therefore required Indexing by static analysis of

grammar rules combined with statistical methods

also can provide a higher aggregate benefit

The current static analysis of grammar rules used

as a basis for indexing does not consider the effect

of the universally quantified constraints that typi-cally augment the signature and grammar rules Fu-ture work will investigate this extension as well

References

B Carpenter and G Penn 1996 Compiling typed attribute-value logic grammars In H Bunt and

M Tomita, editors, Recent Advances in Parsing

Technologies, pages 145–168 Kluwer.

B Carpenter 1992 The Logic of Typed Feature

Structures Cambridge University Press.

P Cousot and R Cousot 1992 Abstract

interpre-tation and application to logic programs Journal

of Logic Programming, 13(2–3).

R Elmasri and S Navathe 2000 Fundamentals of

database systems Addison-Wesley.

D Flickinger 1999 The English Resource Gram-mar http://lingo.stanford.edu/erg.html

P Graf 1996 Term Indexing Springer.

B Kiefer, H.U Krieger, J Carroll, and R Malouf

1999 A bag of useful techniques for efficient and

robust parsing In Proceedings of the 37th

An-nual Meeting of the ACL.

R Malouf, J Carrol, and A Copestake 2000 Effi-cient feature structure operations without

compi-lation Natural Language Engineering, 6(1).

G Penn and C Munteanu 2003 A tabulation-based parsing method that reduces copying In

Proceedings of the 41st Annual Meeting of the ACL, Sapporo, Japan.

G Penn 1999a An optimised Prolog encoding of typed feature structures Technical Report 138, SFB 340, T¨ubingen

G Penn 1999b Optimising don’t-care non-determinism with statistical information Techni-cal Report 140, SFB 340, T¨ubingen

C Pollard and I Sag 1994 Head-driven Phrase

Structure Grammar The University of Chicago

Press

I.V Ramakrishnan, R Sekar, and A Voronkov

2001 Term indexing In Handbook of

Auto-mated Reasoning, volume II, chapter 26 Elsevier

Science

Swedish Institute of Computer Science 2004 SIC-Stus Prolog 3.11.0 http://www.sics.se/sicstus

Proceedings of the 41st Annual Meeting of the ACL, Sapporo, Japan.

G Penn 1999a An optimised Prolog encoding of typed feature structures...

Structure sharing is enforced in descriptions

through the use of variables In TFSGs, the scope

of a variable extends beyond a single description,

re-sulting in structure. .. represent the points at which the parser must copy structure between TFSs They are therefore substructures that must be provided to a rule by the parsing chart if these unifications could poten-tially

Tiêu đề	Optimizing typed feature structure grammar parsing through non-statistical indexing
Tác giả	Cosmin Munteanu, Gerald Penn
Trường học	University of Toronto
Thể loại	báo cáo khoa học
Thành phố	Toronto

Định dạng
Số trang	8
Dung lượng	113,62 KB