1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "THE REPRESENTATION OF CONSTITUENT STRUCTURES FOR FINITE-STATE PARSING" pdf

4 393 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề The Representation Of Constituent Structures For Finite-State Parsing
Tác giả D. T. E. L. Langendoen, Yedl Dya H. Langsam
Trường học Brooklyn College of the City University of New York
Chuyên ngành English and Computer & Information Science
Thể loại báo cáo khoa học
Thành phố Brooklyn
Định dạng
Số trang 4
Dung lượng 309,17 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The class of acceptable expressions of a na- tural language L all manifest no more than a small, fixeR, finite degree n of center embedding.. The first step is to determine a method of r

Trang 1

D T e r e n c e L a n g e n d o e n

Y e d l d y a h Langsam

Departments of English and Computer & Information Science Brooklyn College of the City University of New York

Brooklyn, New York 11210 U.S.A

ABSTBACT

A mixed p r e f i x - p o s t f i x notation for repre-

sentations of the constituent structures of the

expressions of natural languages is proposed,

which are of limited degree of center embedding if

the original expressions are noncenter-embedding

The method of constructing these representations

is applicable to expressions with center embed-

ding, and results in representations which seem to

reflect the ways in which people actually parse

those expressions Both the representations and

their interpretations can be computed from the ex-

pressions from left to right by finite-state de-

v i c e s

The class of acceptable expressions of a na-

tural language L all manifest no more than a

small, fixeR, finite degree n of center embedding

From this observation, it fo~lows that the ability

of human beings to parse the expressions of L can

be modeled by a finite transducer that associates

with the acceptable expressions of L representa-

tions of the structural descriptions of those ex-

pressions This paper considers some initial

steps in the construction of such a model The

first step is to determine a method of represen-

ting the class of constituent structures of the

expressions of L without center embedding in such

a way that the members of that class themselves

have no more than a small fixed finite degree of

center embedding Given a grammar that directly

generates that class of constituent structures, it

is not difficult to construct a deterministic fi-

nite-state transducer (parser) that assigns the

appropriate members of that class to the noncen-

ter-embedded expressions of L from left to right

The second step is to extend the method so that it

is capable of representing the class of constitu-

ent structures of expressions of L with no more

than degree n of center embedding in a manner

man beings actually parse those sentences Given

certain reasonable assumptions about the character

of the rules of grammar of natural languages, we

show how this step can also be taken

*This work was partly supported by a gran t from

the PSC-CUNY Faculty Research Award Program

Let G be a context-free phrase-structure grammar (CFPSG) First, suppose that the c a t e g o r y

A in G is right-recursive; i.e., that there are subderivations with respect to G such that

A = = ~ X A, where X is a nonnull string of symbols (terminal, nonterminal, or mixed) We seek a new CFPSG G*, derived from G, that contains the cate- gory A* (corresponding to A), such that there are subderivations with respect to G* of the form A* = = 8 X* A*, where X* represents the constituent structure of X with respect to G Next, suppose that the category B in G is left-recursive; i.e., that there are subderivations with respect to G such that B = = ~ B Y, where Y is nonnull We seek

a new CFPSG G*, derived from G, that contains the category B* (corresponding to B), such that there are subderivations with respect to G* of the form B* = = ~ B* Y*, where Y* represents the constituent structure of Y with respect to G In other words, given a grammar G, we seek a grammar G* that di- rectly generates strings that represent the con- stituent structures of the n o n c e n t e r - e m b e d d e d ex- pressions generated by G, that is right-recursive wherever G is right-recursive and is left-recur- sive wherever G is left-recursive

In order to find such a G*, we must first de- termine what kinds of strings are available that can represent constituent structures and at the same time can be directly generated by noncenter- embedding grammars Full bracketing diagrams are not suitable, since grammars that generate them are center embedding whenever the original gram- mars are left- or right-recursive (Langendoen 1975) Suppose, however, that we leave off right brackets in right-recursive structures and left brackets in left-recursive structures In right- recursive structures, the positions of the left brackets that remain indicate where each constitu- ent begins; the position where each constituent ends can be determined by a simple counting pro- cedure provided that the number of daughters of that constituent is known (e.g., when the original grammar is in Chomsky-normal-form) Similarly,

in left-recursive structures, the positions of the right brackets that remain indicate where each constituent ends, and the position where each con- stituent begins can also be determined simply by counting Moreover, since brackets no longer oc- cur in matched pairs, the brackets themselves can

be omitted, leaving only the category labels In left-recursive structures, these category symbols occur as postfixes; in right-recursive structures,

Trang 2

w h i c h o c c u r s as a p r e f i x or a p o s t f i x in a s t r i n g

that r e p r e s e n t s the c o n s t i t u e n t s t r u c t u r e of an

e x p r e s s i o n an affix; the s t r i n g s t h e m s e l v e s a f -

f i x e d strings; a n d the g r a m m a r s that g e n e r a t e

those s t r i n g s affix gra1~ars

To see h o w a f f i x g r a m m a r s m a y be c o n s t r u c t e d ,

c o n s i d e r the n o n c e n t e r - e m b e d d i n g C F P S G GI, w h i c h

g e n e r a t e s the a r t i f i c i a l l a n g u a g e L1 = a ( b * a ) * b * a

(G1) a S • S A b A • B A

e S ) a

A n o n c e n t e r - e m b e d d i n g a f f i x g r a m m a r that g e n e r a t e s

the a f f i x e d s t r i n g s that r e p r e s e n t the c o n s t i t u e n t

s t r u c t u r e s of the e x p r e s s i o n s of L1 w i t h r e s p e c t

to G1 is g i v e n in GI*

(GI*) a S ~ ~ S* A * S b A * • A B* A *

c A * > A a d B* • B b

e S ~ ~ S a

A m o n g the e x p r e s s i o n s g e n e r a t e d by GI is El; the

a f f i x e d s t r i n g g e n e r a t e d b y G I * that r e p r e s e n t s

its s t r u c t u r a l d e s c r i p t i o n is El*

(El) a b b a b a

(El*) S a A B b A B b A a S A B b A a S

Let us say that a n a f f i x c o v e r s e l e m e n t s in

an a f f i x e d s t r i n g w h i c h c o r r e s p o n d to its c o n s t i -

tuents (not n e c e s s a r i l y i m m e d i a t e ) T h e n El* m a y

be i n t e r p r e t e d as a s t r u c t u r a l d e s c r i p t i o n of E1

w i t h r e s p e c t to GI a c c o r d i n g to the r u l e s in R,

in w h i c h J, K, a n d L are a f f i x e s ; k is a w o r d ; x

and y are s u b s t r i n g s of a f f i x e d strings; a n d G is

a C F P S G (in this case, GI)

(R) a If K ~ k is a rule of G, t h e n in

the c o n f i g u r a t i o n K k ., K is

a p r e f i x w h i c h c o v e r s k

b If J ~ K L is a rule of G, t h e n

in the c o n f i g u r a t i o n J K x L .,

in w h i c h x does not c o n t a i n L, J is

a p r e f i x w h i c h c o v e r s K L

c° If J d K L is a rule of G, t h e n in

the c o n f i g u r a t i o n K x L y J .,

in w h i c h x does not c o n t a i n L a n d y

does not c o n t a i n K, J is a p o s t f i x

w h i c h c o v e r s K L

C o v e r a g e of c o n s t i t u e n t s b y the r u l e s in R m a y be

t h o u g h t to be a s s i g n e d d y n a m i c a l l y f r o m left to

right

A p o s t f i x is u s e d in rule G l * a b e c a u s e the

c a t e g o r y S is l e f t - r e c u r s i v e in GI, w h e r e a s a p r e -

fix is u s e d in rule G l * b b e c a u s e the c a t e g o r y A is

r i g h t - r e c u r s i v e in GI T h e use of p r e f i x e s in

rules G l * c - e , o n the o t h e r hand, is u n m o t i v a t e d if

to do w i t h d i r e c t i o n of r e c u r s i o n For a f f i x

g r a m m a r s of n a t u r a l l a n g u a g e s , h o w e v e r , one c a n

m o t i v a t e the d e c i s i o n to use a p a r t i c u l a r type of

a f f i x b y p r i n c i p l e s o t h e r t h a n t h o s e h a v i n g to do

w i t h d i r e c t i o n of r e c u r s i o n

The use of a p r e f i x c a n be i n t e r p r e t e d as in-

d i c a t i n g a d e c i s i o n (or g u e s s ) o n the part of the

l a n g u a g e u s e r as to the i d e n t i t y of a p a r t i c u l a r

c o n s t i t u e n t on the b a s i s of the i d e n t i t y of the first c o n s t i t u e n t in it Since l e x i c a l items are

a s s i g n e d to l e x i c a l c a t e g o r i e s e s s e n t i a l l y as s o o n

as they are r e c o g n i z e d ( F o r s t e r 1976), we m a y sup-

p o s e first that p r e f i x e s are u s e d for rules such

as those in G l * c - e that a s s i g n l e x i c a l items to lexical c a t e g o r i e s Second, if, as seems r e a s o n - able, a d e c i s i o n a b o u t the i d e n t i t y of c o n s t i t u - ents is a l w a y s m a d e as soon as p o s s i b l e , then w e

m a y s u p p o s e that p r e f i x e s are u s e d for all r u l e s

in w h i c h the l e f t m o s t d a u g h t e r of a p a r t i c u l a r

c o n s t i t u e n t p r o v i d e s s u f f i c i e n t e v i d e n c e for the

i d e n t i f i c a t i o n of that c o n s t i t u e n t ; e.g., if the

l e f t m o s t d a u g h t e r is e i t h e r the s p e c i f i e r or the

h e a d of that c o n s t i t u e n t in the s e n s e of J a c k e n - doff (1977) Third, w e m a y s u p p o s e that e v e n if the l e f t m o s t d a u g h t e r of a p a r t i c u l a r c o n s t i t u e n t does not p r o v i d e s u f f i c i e n t e v i d e n c e for the i d e n -

t i f i c a t i o n of that c o n s t i t u e n t , a p r e f i x m a y still

be u s e d if that c o n s t i t u e n t is the left s i s t e r of

a c o n s t i t u e n t that p r o v i d e s s u f f i c i e n t e v i d e n c e for its i d e n t i f i c a t i o n F o u r t h , w e m a y s u p p o s e that p o s t f i x e s are u s e d in all o t h e r cases

T o i l l u s t r a t e the use of t h e s e four p r i n -

c i p l e s , c o n s i d e r the n o n c e n t e r - e m b e d d i n g p a r t i a l

g r a m m a r G2 that g e n e r a t e s a f r a g m e n t of E n g l i s h that w e call L2

(G2) a S ~ NP V P b l i p ~ D

c ~ • ~ g d ~ • ~ c

e H > N f V P P V ( [ ~ , C ~ )

g C P C S h C , t h a t

k ~ ~ { b o s s , c h i l d ~

1 V • { k n e w , s a w

• o s

A m o n g the e x p r e s s i o n s of L2 a r e t h o s e w i t h b o t h

r i g h t - r e c u r s i o n a n d l e f t - r e c u r s i o n , s u c h as E2

(E2) the b o s s k n e w that the t e a c h e r ' s s i s -

t e r ' s n e i g h b o r ' s f r i e n d b e l i e v e d that the s t u d e n t s a w the c h i l d

W e n o w g i v e a n a f f i x g r a m m a r G 2 * that d i r e c t -

ly g e n e r a t e s a f f i x e d s t r i n g s that r e p r e s e n t the

s t r u c t u r a l d e s c r i p t i o n s of the e x p r e s s i o n s of L2

w i t h r e s p e c t to G2, a n d that h a s b e e n constructed

in a c c o r d a n c e w i t h the four p r i n c i p l e s d e s c r i b e d above

Trang 3

S N P * V P * I C

ii S* > NP* VP* S / e l s e w h e r e

c N P * - - - ~ G * N * N P

e ~ * - - - ~ R N*

g ~* ~ U c * S *

i 1 ~ - - - ~ D t h e

j G ~ > G ' s

k N* • N ~ c h i l d , h o u s e , ~

1 V ~ ) V ~k.new, s a w , - - - i

Rules G2Wh-I conform to the first principle,

according to which lexical categories generally

appear as prefixes Rules G2*b,e-g conform to the

second principle, according to which a category

appears as a prefix if Its leftmost daughter in

the corresponding rule of G2 is its head or speci-

fier Rule G2*ai conforms to the third principle,

according to which a category appears as a prefix

if its presence can be predicted from its right

sister in G2 Finally, rules G2*aii,c,d conform

to the fourth principle, according to which a ca-

tegory appears as a postfix if it cannot appear as

a prefix according co the preceding three prin-

ciples

The affixed string that G2* generates as the

representation of the structural description of E2

with respect to G2 is given in E2*

(E2*) NP D the N N boss VP V knew C C that S

NP D the N N teacher G ' s G N N sister

NP G ' s G N N neighbor NP G ' s G N N

friend NP VP V believed C C that S NP D

the N N student VP V saw NP D the N N

child S

E2* can be interpreted as the structural descrip-

tion of E2 with respect to G2 by the rules in R,

with the addition of a rule to handle unary non-

lexical branching (as in G2e), and a modification

of Rc to prevent a postfix from simply covering a

sequence of affixes already covered by a prefix

(This restriction is needed to prevent the postfix

S in E2* from simply covering any of the subordi-

nate clauses in that expression.) It is worth

noting how the application of those rules dynami-

cally enlarges the NP that is covered by the S prefix

that follows the words knew that First the tea-

cher is covered; then the teacher's sister; then

the teacher's sister's neighbor; and finally the

teacher's sister's neighbor's friend

The derivation of E2* manifests first-degree center embedding of the category S*, as a result

of the treatment of S as both a prefix and a suf- fix in G2* However, no derivation of an affixed string generated by G2* manifests any greater de- gree of center embedding; hence, the affixed strings associated with the expressions of L2 can still be assigned to them by a finite-state parser The added complexity involved in interpreting E2* results from the fact that all but the first of the NP-VP sequences in E2* are covered by prefix

Ss, so that the constituents covered by the post- fix S in E2* according to rule Rc are considerably far away from it

It will be noted that we have provided two logically independent sets of principles by w h i c h affixed grammars may be constructed from a given CFPSG The first set is explicitly designed to preserve the property of noncenter-embedding The second is designed to maximize the use of prefixes

on the basis of being able to predict the identity

of a constituent by the time its leftmost descen- dent has been identified There is no reason to believe a priori that affixed grammars c o n s t r u c t e d according to the second set of principles should preserve noncenter-embedding, and indeed as we have just seen, they don't However, we conjec- ture chat natural languages are designed so that representations of the structural descriptions of acceptable expressions of those languages can be assigned to them by finite-state parsers that op- erate by identifying constituents as quickly as possible We call this the E f f i c i e n t F i n i t e -

S t a t e P a r s e r H y p o t h e s i s

The four principles for determining w h e t h e r

to use a prefix or a postfix to mark the presence

of a particular constituent apply to grammars that are center embedding as well as to those that are not Suppose we extend the grammar G2 by replac- ing rules G2e and f by rules G2e' and f' respec- tively, and adding rules G2m-s as follows:

(G2) e' N -~ N ( P P 1 )

f , ve > v ( s P ) ( { P e 2 , ~ )

o P P 2 • P 2 N P

q A ~ y e s t e r d a y

r P 1 - - - > o f

S P2 ~ ~in, on, .]

Among the expressions generated by the extended grammar G2 are those in E3

(E3) a the boss knew that the teacher saw

the child yesterday

b the friend of the teacher's sister

Trang 4

biguous with respect to G2, each has a strongly

preferred interpretation Moreover, under each

interpretation, each of these sentences manifests

first-degree center embedding In E3, the includ-

ed VP saw the child is wholly contained in the in-

cluding VP knew that the teacher saw the child

yesterday; and in E3b, the included NP the teacher

is wholly contained in the including NP the friend

of the teacher's sister

Curiously enough, the extension of the affix

grammar that our principles derive from the exten-

sion of the grammar G2 just given associates only

one affixed string with each of the expressions in

E3 That grammar is obtained by replacing rules

G2*e and F with G2*e' and f' respectively, and ad-

ding the rules G2*m-s as follows

(G2*) e' N* > N M* (PPI*)

f' VP* > VP V* (NP*) ([PP2*, C*})

n P P I * - - - > P P 1 P I * NP ~

o P P 2 * > P P 2 P 2 * N P *

p VP* ~ VP* { A * , P P 2 * } VP

s F 2 * - - - ~ P2 f i n , o n J

The affix strings that the extended affix grammar

G2* associates with the expressions in E3 are

given in E3*

(E3 ~) a NP D the N N boss VP V knew C C that

S NP D the N N teacher VP V saw NP D

the N N child A yesterday VP S

b NP D the N N friend PP1 P1 of NP D

the N N teacher G ' s G N N sister NP

We contend that the fact that the expressions

in E3 have a single strongly preferred interpreta-

tion results from the fact that those expressions

have a single affixed string associated with them

Consider first E3a and its associated affixed

string E3*a According to rule Rc, the affix VP

following yesterday is a postfix which covers the

affixes VP and A Now, there is only one occur-

rence of A in E3*a, namely the one that immediate-

ly precedes yesterday; hence that must be the oc-

currence which is covered by the postfix VP On

the other hand, there are two occurrences of pre-

fix VP in E3*a that can legitimately be covered by

the postfix, the one before saw and the one before

knew Suppose in such circumstances, rule Rc

picks out the nearer prefix Then automatically

the complex VP, saw the child yesterday, is co-

vered by the subordinate S prefix, in accordance

with the natural interpretation of the expression

as a whole

string E3*b According to rule Rc, the G is a postfix that covers the affixes NP and G Two oc- currences of the prefix NP are available to be covered; again, we may suppose that rule Rc picks out the nearer one If so, then automatically the complex NP, the teacher's sister, is covered by PPI, again in accordance with the natural inter- pretation of the expression as a whole

This completes our demonstration of the abil- ity of affixed strings to represent the structural descriptions of the acceptable sentences of a na- tural language in a manner which enables them to

be parsed by a finite-state device, and which also predicts the way in which (at least) certain ex- pressions with center embedding are actually in- terpreted Much more could be said about the sys- tem of representation we propose, but time and space limitations preclude further discussion here We leave as exercises to the reader the demonstration that the expression E4a has a single affixed string associated with it by G2*, and that the left-branching (stacked) interpretation of E4b

is predicted to be preferred over the right- branching interpretation

(E4) a the student saw the teacher in the

house

b the house in the woods near the stream

ACKNOWLEDGMENT

We thank Maria E d e l s t e i n f o r her i n v a l u a b l e

h e l p i n d e v e l o p i n g the work presented here

REFERENCES

Forster, Kenneth I (1976) Accessing the mental lexicon In R.J Wales and E.T Walker, eds., New Approaches to Language Mechanisms Amsterdam: North-Holland

Jackendoff, Ray S (1977) X-Bar Syntax Cam- bridge, Mass.: MIT Press

Langendoen, D Terence (1975) Finite-state par- sing of phrase-structure languages and the status of readjustment rules in grammar Linguistic Inquiry 6.533-54

Ngày đăng: 08/03/2014, 18:20

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm