The class of acceptable expressions of a na- tural language L all manifest no more than a small, fixeR, finite degree n of center embedding.. The first step is to determine a method of r
Trang 1D T e r e n c e L a n g e n d o e n
Y e d l d y a h Langsam
Departments of English and Computer & Information Science Brooklyn College of the City University of New York
Brooklyn, New York 11210 U.S.A
ABSTBACT
A mixed p r e f i x - p o s t f i x notation for repre-
sentations of the constituent structures of the
expressions of natural languages is proposed,
which are of limited degree of center embedding if
the original expressions are noncenter-embedding
The method of constructing these representations
is applicable to expressions with center embed-
ding, and results in representations which seem to
reflect the ways in which people actually parse
those expressions Both the representations and
their interpretations can be computed from the ex-
pressions from left to right by finite-state de-
v i c e s
The class of acceptable expressions of a na-
tural language L all manifest no more than a
small, fixeR, finite degree n of center embedding
From this observation, it fo~lows that the ability
of human beings to parse the expressions of L can
be modeled by a finite transducer that associates
with the acceptable expressions of L representa-
tions of the structural descriptions of those ex-
pressions This paper considers some initial
steps in the construction of such a model The
first step is to determine a method of represen-
ting the class of constituent structures of the
expressions of L without center embedding in such
a way that the members of that class themselves
have no more than a small fixed finite degree of
center embedding Given a grammar that directly
generates that class of constituent structures, it
is not difficult to construct a deterministic fi-
nite-state transducer (parser) that assigns the
appropriate members of that class to the noncen-
ter-embedded expressions of L from left to right
The second step is to extend the method so that it
is capable of representing the class of constitu-
ent structures of expressions of L with no more
than degree n of center embedding in a manner
man beings actually parse those sentences Given
certain reasonable assumptions about the character
of the rules of grammar of natural languages, we
show how this step can also be taken
*This work was partly supported by a gran t from
the PSC-CUNY Faculty Research Award Program
Let G be a context-free phrase-structure grammar (CFPSG) First, suppose that the c a t e g o r y
A in G is right-recursive; i.e., that there are subderivations with respect to G such that
A = = ~ X A, where X is a nonnull string of symbols (terminal, nonterminal, or mixed) We seek a new CFPSG G*, derived from G, that contains the cate- gory A* (corresponding to A), such that there are subderivations with respect to G* of the form A* = = 8 X* A*, where X* represents the constituent structure of X with respect to G Next, suppose that the category B in G is left-recursive; i.e., that there are subderivations with respect to G such that B = = ~ B Y, where Y is nonnull We seek
a new CFPSG G*, derived from G, that contains the category B* (corresponding to B), such that there are subderivations with respect to G* of the form B* = = ~ B* Y*, where Y* represents the constituent structure of Y with respect to G In other words, given a grammar G, we seek a grammar G* that di- rectly generates strings that represent the con- stituent structures of the n o n c e n t e r - e m b e d d e d ex- pressions generated by G, that is right-recursive wherever G is right-recursive and is left-recur- sive wherever G is left-recursive
In order to find such a G*, we must first de- termine what kinds of strings are available that can represent constituent structures and at the same time can be directly generated by noncenter- embedding grammars Full bracketing diagrams are not suitable, since grammars that generate them are center embedding whenever the original gram- mars are left- or right-recursive (Langendoen 1975) Suppose, however, that we leave off right brackets in right-recursive structures and left brackets in left-recursive structures In right- recursive structures, the positions of the left brackets that remain indicate where each constitu- ent begins; the position where each constituent ends can be determined by a simple counting pro- cedure provided that the number of daughters of that constituent is known (e.g., when the original grammar is in Chomsky-normal-form) Similarly,
in left-recursive structures, the positions of the right brackets that remain indicate where each constituent ends, and the position where each con- stituent begins can also be determined simply by counting Moreover, since brackets no longer oc- cur in matched pairs, the brackets themselves can
be omitted, leaving only the category labels In left-recursive structures, these category symbols occur as postfixes; in right-recursive structures,
Trang 2w h i c h o c c u r s as a p r e f i x or a p o s t f i x in a s t r i n g
that r e p r e s e n t s the c o n s t i t u e n t s t r u c t u r e of an
e x p r e s s i o n an affix; the s t r i n g s t h e m s e l v e s a f -
f i x e d strings; a n d the g r a m m a r s that g e n e r a t e
those s t r i n g s affix gra1~ars
To see h o w a f f i x g r a m m a r s m a y be c o n s t r u c t e d ,
c o n s i d e r the n o n c e n t e r - e m b e d d i n g C F P S G GI, w h i c h
g e n e r a t e s the a r t i f i c i a l l a n g u a g e L1 = a ( b * a ) * b * a
(G1) a S • S A b A • B A
e S ) a
A n o n c e n t e r - e m b e d d i n g a f f i x g r a m m a r that g e n e r a t e s
the a f f i x e d s t r i n g s that r e p r e s e n t the c o n s t i t u e n t
s t r u c t u r e s of the e x p r e s s i o n s of L1 w i t h r e s p e c t
to G1 is g i v e n in GI*
(GI*) a S ~ ~ S* A * S b A * • A B* A *
c A * > A a d B* • B b
e S ~ ~ S a
A m o n g the e x p r e s s i o n s g e n e r a t e d by GI is El; the
a f f i x e d s t r i n g g e n e r a t e d b y G I * that r e p r e s e n t s
its s t r u c t u r a l d e s c r i p t i o n is El*
(El) a b b a b a
(El*) S a A B b A B b A a S A B b A a S
Let us say that a n a f f i x c o v e r s e l e m e n t s in
an a f f i x e d s t r i n g w h i c h c o r r e s p o n d to its c o n s t i -
tuents (not n e c e s s a r i l y i m m e d i a t e ) T h e n El* m a y
be i n t e r p r e t e d as a s t r u c t u r a l d e s c r i p t i o n of E1
w i t h r e s p e c t to GI a c c o r d i n g to the r u l e s in R,
in w h i c h J, K, a n d L are a f f i x e s ; k is a w o r d ; x
and y are s u b s t r i n g s of a f f i x e d strings; a n d G is
a C F P S G (in this case, GI)
(R) a If K ~ k is a rule of G, t h e n in
the c o n f i g u r a t i o n K k ., K is
a p r e f i x w h i c h c o v e r s k
b If J ~ K L is a rule of G, t h e n
in the c o n f i g u r a t i o n J K x L .,
in w h i c h x does not c o n t a i n L, J is
a p r e f i x w h i c h c o v e r s K L
c° If J d K L is a rule of G, t h e n in
the c o n f i g u r a t i o n K x L y J .,
in w h i c h x does not c o n t a i n L a n d y
does not c o n t a i n K, J is a p o s t f i x
w h i c h c o v e r s K L
C o v e r a g e of c o n s t i t u e n t s b y the r u l e s in R m a y be
t h o u g h t to be a s s i g n e d d y n a m i c a l l y f r o m left to
right
A p o s t f i x is u s e d in rule G l * a b e c a u s e the
c a t e g o r y S is l e f t - r e c u r s i v e in GI, w h e r e a s a p r e -
fix is u s e d in rule G l * b b e c a u s e the c a t e g o r y A is
r i g h t - r e c u r s i v e in GI T h e use of p r e f i x e s in
rules G l * c - e , o n the o t h e r hand, is u n m o t i v a t e d if
to do w i t h d i r e c t i o n of r e c u r s i o n For a f f i x
g r a m m a r s of n a t u r a l l a n g u a g e s , h o w e v e r , one c a n
m o t i v a t e the d e c i s i o n to use a p a r t i c u l a r type of
a f f i x b y p r i n c i p l e s o t h e r t h a n t h o s e h a v i n g to do
w i t h d i r e c t i o n of r e c u r s i o n
The use of a p r e f i x c a n be i n t e r p r e t e d as in-
d i c a t i n g a d e c i s i o n (or g u e s s ) o n the part of the
l a n g u a g e u s e r as to the i d e n t i t y of a p a r t i c u l a r
c o n s t i t u e n t on the b a s i s of the i d e n t i t y of the first c o n s t i t u e n t in it Since l e x i c a l items are
a s s i g n e d to l e x i c a l c a t e g o r i e s e s s e n t i a l l y as s o o n
as they are r e c o g n i z e d ( F o r s t e r 1976), we m a y sup-
p o s e first that p r e f i x e s are u s e d for rules such
as those in G l * c - e that a s s i g n l e x i c a l items to lexical c a t e g o r i e s Second, if, as seems r e a s o n - able, a d e c i s i o n a b o u t the i d e n t i t y of c o n s t i t u - ents is a l w a y s m a d e as soon as p o s s i b l e , then w e
m a y s u p p o s e that p r e f i x e s are u s e d for all r u l e s
in w h i c h the l e f t m o s t d a u g h t e r of a p a r t i c u l a r
c o n s t i t u e n t p r o v i d e s s u f f i c i e n t e v i d e n c e for the
i d e n t i f i c a t i o n of that c o n s t i t u e n t ; e.g., if the
l e f t m o s t d a u g h t e r is e i t h e r the s p e c i f i e r or the
h e a d of that c o n s t i t u e n t in the s e n s e of J a c k e n - doff (1977) Third, w e m a y s u p p o s e that e v e n if the l e f t m o s t d a u g h t e r of a p a r t i c u l a r c o n s t i t u e n t does not p r o v i d e s u f f i c i e n t e v i d e n c e for the i d e n -
t i f i c a t i o n of that c o n s t i t u e n t , a p r e f i x m a y still
be u s e d if that c o n s t i t u e n t is the left s i s t e r of
a c o n s t i t u e n t that p r o v i d e s s u f f i c i e n t e v i d e n c e for its i d e n t i f i c a t i o n F o u r t h , w e m a y s u p p o s e that p o s t f i x e s are u s e d in all o t h e r cases
T o i l l u s t r a t e the use of t h e s e four p r i n -
c i p l e s , c o n s i d e r the n o n c e n t e r - e m b e d d i n g p a r t i a l
g r a m m a r G2 that g e n e r a t e s a f r a g m e n t of E n g l i s h that w e call L2
(G2) a S ~ NP V P b l i p ~ D
c ~ • ~ g d ~ • ~ c
e H > N f V P P V ( [ ~ , C ~ )
g C P C S h C , t h a t
k ~ ~ { b o s s , c h i l d ~
1 V • { k n e w , s a w
• o s
A m o n g the e x p r e s s i o n s of L2 a r e t h o s e w i t h b o t h
r i g h t - r e c u r s i o n a n d l e f t - r e c u r s i o n , s u c h as E2
(E2) the b o s s k n e w that the t e a c h e r ' s s i s -
t e r ' s n e i g h b o r ' s f r i e n d b e l i e v e d that the s t u d e n t s a w the c h i l d
W e n o w g i v e a n a f f i x g r a m m a r G 2 * that d i r e c t -
ly g e n e r a t e s a f f i x e d s t r i n g s that r e p r e s e n t the
s t r u c t u r a l d e s c r i p t i o n s of the e x p r e s s i o n s of L2
w i t h r e s p e c t to G2, a n d that h a s b e e n constructed
in a c c o r d a n c e w i t h the four p r i n c i p l e s d e s c r i b e d above
Trang 3S N P * V P * I C
ii S* > NP* VP* S / e l s e w h e r e
c N P * - - - ~ G * N * N P
e ~ * - - - ~ R N*
g ~* ~ U c * S *
i 1 ~ - - - ~ D t h e
j G ~ > G ' s
k N* • N ~ c h i l d , h o u s e , ~
1 V ~ ) V ~k.new, s a w , - - - i
Rules G2Wh-I conform to the first principle,
according to which lexical categories generally
appear as prefixes Rules G2*b,e-g conform to the
second principle, according to which a category
appears as a prefix if Its leftmost daughter in
the corresponding rule of G2 is its head or speci-
fier Rule G2*ai conforms to the third principle,
according to which a category appears as a prefix
if its presence can be predicted from its right
sister in G2 Finally, rules G2*aii,c,d conform
to the fourth principle, according to which a ca-
tegory appears as a postfix if it cannot appear as
a prefix according co the preceding three prin-
ciples
The affixed string that G2* generates as the
representation of the structural description of E2
with respect to G2 is given in E2*
(E2*) NP D the N N boss VP V knew C C that S
NP D the N N teacher G ' s G N N sister
NP G ' s G N N neighbor NP G ' s G N N
friend NP VP V believed C C that S NP D
the N N student VP V saw NP D the N N
child S
E2* can be interpreted as the structural descrip-
tion of E2 with respect to G2 by the rules in R,
with the addition of a rule to handle unary non-
lexical branching (as in G2e), and a modification
of Rc to prevent a postfix from simply covering a
sequence of affixes already covered by a prefix
(This restriction is needed to prevent the postfix
S in E2* from simply covering any of the subordi-
nate clauses in that expression.) It is worth
noting how the application of those rules dynami-
cally enlarges the NP that is covered by the S prefix
that follows the words knew that First the tea-
cher is covered; then the teacher's sister; then
the teacher's sister's neighbor; and finally the
teacher's sister's neighbor's friend
The derivation of E2* manifests first-degree center embedding of the category S*, as a result
of the treatment of S as both a prefix and a suf- fix in G2* However, no derivation of an affixed string generated by G2* manifests any greater de- gree of center embedding; hence, the affixed strings associated with the expressions of L2 can still be assigned to them by a finite-state parser The added complexity involved in interpreting E2* results from the fact that all but the first of the NP-VP sequences in E2* are covered by prefix
Ss, so that the constituents covered by the post- fix S in E2* according to rule Rc are considerably far away from it
It will be noted that we have provided two logically independent sets of principles by w h i c h affixed grammars may be constructed from a given CFPSG The first set is explicitly designed to preserve the property of noncenter-embedding The second is designed to maximize the use of prefixes
on the basis of being able to predict the identity
of a constituent by the time its leftmost descen- dent has been identified There is no reason to believe a priori that affixed grammars c o n s t r u c t e d according to the second set of principles should preserve noncenter-embedding, and indeed as we have just seen, they don't However, we conjec- ture chat natural languages are designed so that representations of the structural descriptions of acceptable expressions of those languages can be assigned to them by finite-state parsers that op- erate by identifying constituents as quickly as possible We call this the E f f i c i e n t F i n i t e -
S t a t e P a r s e r H y p o t h e s i s
The four principles for determining w h e t h e r
to use a prefix or a postfix to mark the presence
of a particular constituent apply to grammars that are center embedding as well as to those that are not Suppose we extend the grammar G2 by replac- ing rules G2e and f by rules G2e' and f' respec- tively, and adding rules G2m-s as follows:
(G2) e' N -~ N ( P P 1 )
f , ve > v ( s P ) ( { P e 2 , ~ )
o P P 2 • P 2 N P
q A ~ y e s t e r d a y
r P 1 - - - > o f
S P2 ~ ~in, on, .]
Among the expressions generated by the extended grammar G2 are those in E3
(E3) a the boss knew that the teacher saw
the child yesterday
b the friend of the teacher's sister
Trang 4biguous with respect to G2, each has a strongly
preferred interpretation Moreover, under each
interpretation, each of these sentences manifests
first-degree center embedding In E3, the includ-
ed VP saw the child is wholly contained in the in-
cluding VP knew that the teacher saw the child
yesterday; and in E3b, the included NP the teacher
is wholly contained in the including NP the friend
of the teacher's sister
Curiously enough, the extension of the affix
grammar that our principles derive from the exten-
sion of the grammar G2 just given associates only
one affixed string with each of the expressions in
E3 That grammar is obtained by replacing rules
G2*e and F with G2*e' and f' respectively, and ad-
ding the rules G2*m-s as follows
(G2*) e' N* > N M* (PPI*)
f' VP* > VP V* (NP*) ([PP2*, C*})
n P P I * - - - > P P 1 P I * NP ~
o P P 2 * > P P 2 P 2 * N P *
p VP* ~ VP* { A * , P P 2 * } VP
s F 2 * - - - ~ P2 f i n , o n J
The affix strings that the extended affix grammar
G2* associates with the expressions in E3 are
given in E3*
(E3 ~) a NP D the N N boss VP V knew C C that
S NP D the N N teacher VP V saw NP D
the N N child A yesterday VP S
b NP D the N N friend PP1 P1 of NP D
the N N teacher G ' s G N N sister NP
We contend that the fact that the expressions
in E3 have a single strongly preferred interpreta-
tion results from the fact that those expressions
have a single affixed string associated with them
Consider first E3a and its associated affixed
string E3*a According to rule Rc, the affix VP
following yesterday is a postfix which covers the
affixes VP and A Now, there is only one occur-
rence of A in E3*a, namely the one that immediate-
ly precedes yesterday; hence that must be the oc-
currence which is covered by the postfix VP On
the other hand, there are two occurrences of pre-
fix VP in E3*a that can legitimately be covered by
the postfix, the one before saw and the one before
knew Suppose in such circumstances, rule Rc
picks out the nearer prefix Then automatically
the complex VP, saw the child yesterday, is co-
vered by the subordinate S prefix, in accordance
with the natural interpretation of the expression
as a whole
string E3*b According to rule Rc, the G is a postfix that covers the affixes NP and G Two oc- currences of the prefix NP are available to be covered; again, we may suppose that rule Rc picks out the nearer one If so, then automatically the complex NP, the teacher's sister, is covered by PPI, again in accordance with the natural inter- pretation of the expression as a whole
This completes our demonstration of the abil- ity of affixed strings to represent the structural descriptions of the acceptable sentences of a na- tural language in a manner which enables them to
be parsed by a finite-state device, and which also predicts the way in which (at least) certain ex- pressions with center embedding are actually in- terpreted Much more could be said about the sys- tem of representation we propose, but time and space limitations preclude further discussion here We leave as exercises to the reader the demonstration that the expression E4a has a single affixed string associated with it by G2*, and that the left-branching (stacked) interpretation of E4b
is predicted to be preferred over the right- branching interpretation
(E4) a the student saw the teacher in the
house
b the house in the woods near the stream
ACKNOWLEDGMENT
We thank Maria E d e l s t e i n f o r her i n v a l u a b l e
h e l p i n d e v e l o p i n g the work presented here
REFERENCES
Forster, Kenneth I (1976) Accessing the mental lexicon In R.J Wales and E.T Walker, eds., New Approaches to Language Mechanisms Amsterdam: North-Holland
Jackendoff, Ray S (1977) X-Bar Syntax Cam- bridge, Mass.: MIT Press
Langendoen, D Terence (1975) Finite-state par- sing of phrase-structure languages and the status of readjustment rules in grammar Linguistic Inquiry 6.533-54