Tài liệu Báo cáo khoa học: "Deterministic Parsing of Syntactic Non-fluencies" docx

I will begin by discussing the range of non-fluencies that occur in speech... The minimal non-lexical material that self-correction might insert is the editing signal itself... A s with

Trang 1

D e t e r m i n i s t i c P a r s i n g of S y n t a c t i c N o n - f l u e n c i e s

D o n a l d H i n d l e

Bell L a b o r a t o r i e s Murray Hill, New J e r s e y 07974

It is often r e m a r k e d that natural language, used

naturally, is unnaturally u n g r a m m a t i c a l * S p o n t a n e o u s

speech contains all m a n n e r of false starts, hesitations, and

self-corrections that disrupt the w e l l - f o r m e d n e s s of strings

It is a mystery then, that despite this a p p a r e n t wide

deviation f r o m grammatical n o r m s , people have little

difficx:lty u n d e r s t a n d i n g the n o n - f l u e n t speech that is the

essential m e d i u m of e v e r y d a y life A n d it is a still greater

mystery that children can succeed in acquiring the g r a m m a r

of a language on the basis of evidence provided by a mixed

set of apparently grammatical and u n g r a m m a t i c a l strings

I Sell-correction: a Rule-governed System

In this paper I present a system of rules for resolving the

non-fluencies of speech, i m p l e m e n t e d as part of a

computational model of syntactic processing The essential

idea is that non-fluencies occur when a speaker corrects

s o m e t h i n g that he or she has already said out loud Since

words once said cannot be unsaid, a s p e a k e r can only

accomplish a self-correction by saying s o m e t h i n g additional

namely the i n t e n d e d words The i n t e n d e d words are

supposed to substitute for the wrongly p r o d u c e d words

For e x a m p l e , in sentence (1), the s p e a k e r initially said I but

m e a n t we

(1) I was we were hungry

The p r o b l e m for the hearer, as for any natural language

u n d e r s t a n d i n g system, is to d e t e r m i n e what words are to be

e x p u n g e d from the actual words said to find the i n t e n d e d

sentence

Labov (1966) provided the key to solving this p r o b l e m

when he noted that a phonetic signal (specifically, a

markedly abrupt cut-off of the speech signal) always marks

the site where self-correction takes place Of course,

finding the site of a self-correction is only half the p r o b l e m ;

it remains to specify what should be r e m o v e d A first guess

suggests that this must be a non-deterministic p r o b l e m ,

requiring complex reasoning about what the speaker meant

to say Labov claimed that a simple set of rules operating

on the surface string would specify exactly what should b e

c h a n g e d , t r a n s f o r m i n g nearly all non-fluent strings into

fully grammatical sentences The specific set of

transformational rules L a b o r p r o p o s e d were not formally

a d e q u a t e , in part because they were surface t r a n s f o r m a t i o n s

which ignored syntactic c o n s t i t u e n t h o o d But his work

forms the basis of this current analysis

This research was d o n e for the most part at the U n i v e r s i t y of

P e n n s y l v a m a s u p p o r t e d by the National Institute of E d u c a t i o n under

g r a n t s GTg-0169 and G80-0163

L a b o r ' s claim was not of course that u n g r a m m a t i c a l

s e n t e n c e s are n e v e r p r o d u c e d in s p e e c h , for that clearly would be false R a t h e r , it s e e m s that truly u n g r a m m a t i c a l

p r o d u c t i o n s r e p r e s e n t only a tiny fraction of the s p o k e n output, and in the p r e p o n d e r a n c e of cases, an a p p a r e n t ungrammaticality can be resolved by simple editing rules In

o r d e r to make sense of n o n - f l u e n t speech, it is essential that the various types of grammatical deviation be distinguished This point has s o m e t i m e s been missed, and

f u n d a m e n t a l l y d i f f e r e n t kinds of deviation from standard grammaticality have been treated t o g e t h e r because they all

p r e s e n t the same sort of p r o b l e m for a natural language

u n d e r s t a n d i n g system For e x a m p l e , H a y e s and Mouradian (1981) mix t o g e t h e r speaker-initiated self-corrections with

f r a g m e n t a r y s e n t e n c e s of all sorts:

people often leave out or r e p e a t words or phrases, break off what they are saying and r e p h r a s e or replace it, speak in f r a g m e n t s , or o t h e r w i s e use incorrect g r a m m a r (1981:231)

Ultimately, it will be fluent productions on are fully grammatical other A l t h o u g h we characterization of

essential to distinguish b e t w e e n non- the one hand, and constructions that though not yet u n d e r s t o o d , on the may not know in detail the correct such processes as ellipsis and conjunction, they are without d o u b t fully productive grammatical processes Without an u n d e r s t a n d i n g of the

d i f f e r e n c e s in the kinds of non-fluencies that occur, we are left with a kind of grab bag of grammatical deviation that can n e v e r be analyzed except by some sort of general purpose m e c h a n i s m s

In this paper, I want to characterize the subset of s p o k e n non-fluencies that can be treated as self-corrections, and to describe h o w they are handled in the context of a deterministic parser I assume that a system for dealing with self-corrections similar to the one I describe must be a part of the c o m p e t e n c e of any natural language user I will begin by discussing the range of non-fluencies that occur in speech T h e n , after reviewing the notion of deterministic parsing, I will describe the model of parsing self-corrections

in detail, and r e p o r t results from a sample of 1500 sentences Finally, I discuss some implications of this theory of self-correction, particularly for the p r o b l e m of language acquisition

2 Errors in Spontaneous Speech

Linguists have been of less help in describing the nature

of spoken non-fluencies than might have been h o p e d ; relatively little attention has been d e v o t e d to the actual

p e r f o r m a n c e of speakers, and studies that claim to be based

Trang 2

on p e r f o r m a n c e data s e e m to ignore the p r o b l e m of non-

fluencies ( N o t a b l e e x c e p t i o n s include F r o m k i n (1980), and

T h o m p s o n (1980)) For the discussion of self-correction, I

want to distinguish three types of n o n - f l u e n c i e s that

typically occur in s p e e c h

1 Unusual C o n s t r u c t i o n s It is p e r h a p s worth

e m p h a s i z i n g that the m e r e fact that a p a r s e r does not h a n d l e

a construction, or that linguists have not discussed it, d o e s

not m e a n that it is u n g r a m m a t i c a l In s p e e c h , there is a

range of m o r e or less unusual constructions which occur

productively (some occur in writing as well), and which

c a n n o t be c o n s i d e r e d syntactically ill-formed For e x a m p l e ,

(2a) I i m a g i n e t h e r e ' s a lot of t h e m must have had s o m e

good r e a s o n s not to go there

(2b) T h a t ' s the only thing he does is fight

Sentence (2a) is an e x a m p l e of n o n - s t a n d a r d subject relative

clauses that are c o m m o n in speech S e n t e n c e (2b), which

s e e m s to have two t e n s e d "be" verbs in one clause is a

p r o d u c t i v e s e n t e n c e type that occurs regularly, t h o u g h

rarely, in all sorts of s p o k e n discourse (see K r o c h and

Hindle 1981) I a s s u m e that a correct and c o m p l e t e

g r a m m a r for a parser will have to deal with all grammatical

processes, marginal as well as central I have n o t h i n g

f u r t h e r to say about unusual constructions h e r e

2 True U n g r a m m a t i c a l / t i e s A small p e r c e n t a g e of

s p o k e n utterances are truly u n g r a m m a t i c a l That is, they do

not result f r o m any regular grammatical process ( h o w e v e r

rare), nor are they instances of successful self-correction

U n e x c e p t i o n a b l e e x a m p l e s are hard to find, but the

following give the flavor

(3a) I've seen it h a p p e n is two girls fight

(3b) T o d a y if you beat a guy wants to blow your h e a d

off for s o m e t h i n g

(3c) A n d aa a lot of the kids that are f r o m our

n e i g h b o r h o o d - - t h e r e ' s one section that the kids a r e n ' t

too think they would usually the the o n e s that were

the the d r o p outs and the s t o n e h e a d s

L a b o v (1966) r e p o r t e d that less that 2% of the s e n t e n c e s in

a sample of a variety of types of conversational English

were u n g r a m m a t i c a l in this sense, a result that is c o n f i r m e d

by c u r r e n t work ( K r o c h and Hindle 1981)

3 Self-corrected strings This type of non-fluency is the

focus of this paper Self-corrected strings all have the

characteristic that some e x t r a n e o u s material was a p p a r e n t l y

inserted, and that e x p u n g i n g some substring results in a

w e l l - f o r m e d syntactic structure, which is apparently

consistent with the m e a n i n g that is i n t e n d e d

In the d e g e n e r a t e case, self-correction inserts non-lexical

material, which the syntactic p r o c e s s o r ignores, as in (4)

(aa) He was uh still asleep

(4b) I didn't ko go right into college

The minimal non-lexical material that self-correction might

insert is the editing signal itself O t h e r cases ( e x a m p l e s 6-

10 below) are only i n t e r p r e t a b l e given the a s s u m p t i o n that

certain words, which are potentially part of the syntactic

structure, are to be r e m o v e d f r o m the syntactic analysis

The status of the material that is c o r r e c t e d by self-

c o r r e c t i o n and is e x p u n g e d by the editing rules is s o m e w h a t

odd I use the t e r m expunction to m e a n that it is r e m o v e d

f r o m any f u r t h e r syntactic analysis This d o e s not m e a n

h o w e v e r that a s e l f - c o r r e c t e d string is unavailable for

s e m a n t i c processing A l t h o u g h the s e l f - c o r r e c t e d string is

edited f r o m the syntacti c analysis, it is n e v e r t h e l e s s available for s e m a n t i c i n t e r p r e t a t i o n J e f f e r s o n (1974) discusses the e x a m p l e

(5) [thuh] [thiy] o f f i c e r

w h e r e the initial, s e l f - c o r r e c t e d string (with the pre-

c o n s o n a n t a l f o r m of the r a t h e r than the pre-vocalic f o r m )

m a k e s it clear that the s p e a k e r originally inteTided to r e f e r

to the police by s o m e w o r d o t h e r than officer

I should also note that the p r o b l e m s a d d r e s s e d by the

s e l f - c o r r e c t i o n c o m p o n e n t that I am c o n c e r n e d with are only part of the kind of d e v i a n c e that occurs in natural language use Many types of naturally occurring e r r o r s are not part of this s y s t e m , for e x a m p l e , p h o n o l o g i c a l and

s e m a n t i c errors It is r e a s o n a b l e to h o p e that much of this dreck will be h a n d l e d by similar s u b s y s t e m s Of course,

t h e r e will always r e m a i n e r r o r s that are outside of any system But we e x p e c t that the a p p a r e n t chaos is much

m o r e regular than it at first a p p e a r s and that it can be

m o d e l e d by the interaction of c o m p o n e n t s that are

t h e m s e l v e s simple

In the following discussion, I use the terms self- correction and editing m o r e or less i n t e r c h a n g e a b l y , though

the two terms e m p h a s i z e the g e n e r a t i o n and i n t e r p r e t a t i o n aspects of the same process

3 T h e Parser

The editing system that I will describe is i m p l e m e n t e d on

top of a d e t e r m i n i s t i c parser, called Fidditch based on the

processing principles p r o p o s e d by Marcus (1980) It takes

as input a s e n t e n c e of s t a n d a r d words and returns a labeled bracketing that r e p r e s e n t s the syntactic structure as an

a n n o t a t e d tree structure Fidditch w a s ' d e s i g n e d to process transcripts of s p o n t a n e o u s s p e e c h , and to produce an analysis, partial if necessary, for a large corpus of i n t e r v i e w transcripts Because Jris a d e t e r m i n i s t i c parser, it produces only one analysis for each s e n t e n c e W h e n Fidditch is unable to build larger constituents out of s u b p h r a s e s , it

m o v e s on to the next c o n s t i t u e n t of the s e n t e n c e

In brief, the parsing process p r o c e e d s as follows The words in a t r a n s c r i b e d s e n t e n c e ( w h e r e s e n t e n c e m e a n s one tensed clause t o g e t h e r with all s u b o r d i n a t e clauses) are assigned a lexical c a t e g o r y (or set of lexical categories) on the basis of a 2000 word lexicon and a m o r p h o l o g i c a l analyzer The lexicon contains, for each w o r d , a list of possible lexical categories, s u b c a t e g o r i z a t i o n i n f o r m a t i o n , and in a few cases, i n f o r m a t i o n on c o m p o u n d words For

e x a m p l e , the entry for round states that it is a n o u n , verb,

adjective or p r e p o s i t i o n , that as a verb it is s u b c a t e g o r i z e d

for the m o v a b l e particles out and up and for NP, and that it may be part of the c o m p o u n d a d j e c t i v e / p r e p o s i t i o n round about

Once the lexical analysis is c o m p l e t e , The phrase structure tree is c o n s t r u c t e d on the basis of pattern-action rules using two internal data structures: 1) a p u s h - d o w n stack of i n c o m p l e t e n o d e s , and 2) a buffer of c o m p l e t e constituents, into which the g r a m m a r rules can look t h r o u g h

124

Trang 3

a w i n d o w of three constituents The parser matches rule

patterns to the configuration of the w i n d o w and stack Its

basic actions include

- - starting to build a new n o d e by pushing a category onto

the stack

- - attaching the first e l e m e n t of the window to the stack

- - d r o p p i n g subtrees f r o m the stack into the first position in

the w i n d o w when they are c o m p l e t e

The parser p r o c e e d s deterministically in the sense that no

aspect of the tree structure, once built may be altered by

any rule (See Marcus 1980 for a c o m p r e h e n s i v e discussion

of this theory of parsing.)

4 T h e s e r f - c o r r e c t i o n r u l e s

The self-correction rules specify h o w much, if anything,

to e x p u n g e when an editing signal is detected The rules

d e p e n d crucially on being able to r e c o g n i z e an editing

signal, for that marks the right edge of an expunction site

For the p r e s e n t discussion, I will assume little about the

phonetic nature of the signal except that it is phonetically

recognizable, and that, w h a t e v e r their phonetic nature, all

editing signals are, for the self-correction system,

equivalent Specifying the nature of the editing signal is,

obviously, an area where further research is n e e d e d

The only action that the editing rules can p e r f o r m is

e x p u n c t i o n , by which I mean r e m o v i n g an e l e m e n t from the

view of the parser The rules never replace one e l e m e n t

with a n o t h e r or insert an e l e m e n t in the parser data

structures H o w e v e r , both r e p l a c e m e n t s and insertions can

be accomplished within the self-correction system by

expunction of partially identical strings For e x a m p l e , in

(6) I am I was really a n n o y e d

The self-correction rules will e x p u n g e the I a m which

p r e c e d e s the editing signal, t h e r e b y in effect replacing a m

with w a s and inserting r e a l l y

Self-corrected strings can be viewed formally as having

extra material inserted, but not involving either deletion or

r e p l a c e m e n t of material The linguistic system does seem to

make use of both deletions and r e p l a c e m e n t s in other

subsystems of g r a m m a r h o w e v e r , namely in ellipsis and

rank s h i f t A s with the editing system, these are not errors

but formal systems that interact with the central features of

the syntax True errors do of course occur involving all

three logical possibilities (insertion, deletion, and

r e p l a c e m e n t ) but these are relatively rare

The self-correction rules have access to the internal data

structures of the parser, and like the parser itself, they

overate deterministicallv The parser views the editing

signal as occurring at the e n d of a constituent, because it

marks the r i g h t edge of an e x p u n g e d e l e m e n t There are

two types of editing rules in the system: expunction of

copies, for which there are three rules, and lexically

triggered restarts, for which there is one rule

4 1 C o p y E d i t i n g

The copying rules say that if you have two e l e m e n t s

which are the same and they are separated by an editing

signal, the first should be e x p u n g e d from the structure

Obviously the trick here is to d e t e r m i n e what counts as

copies T h e r e are three specific places w h e r e copy editing applies

S U R F A C E C O P Y E D I T O R This is essentially a non- syntactic rule that matches the surface string on either side

of the editing signal, and e x p u n g e s the first copy It applies to the surface string (i.e., for transcripts, the

o r t h o g r a p h i c string) before any syntactic p r o c t i , ~ For

e x a m p l e , in (7), the u n d e r l i n e d strings are e x p u n g e d before parsing begins

(7a) Well i f t h e y ' d - - if t h e y ' d had a knife 1 w o u - - I

w o u l d n ' t be here today

(Tb) l f t h e y - - if they could do it

Typically, the Surface C o p y E d i t o r e x p u n g e s a string of words that would later be analyzed as a constituent (or partial constituent), and would be e x p u n g e d by the

C a t e g o r y or the Stack E d i t o r s (as in 7a) H o w e v e r the string that is e x p u n g e d by the Surface Copy Editor need not

be d o m i n a t e d by a single n o d e ; it can be a s e q u e n c e of unrelated constituents For e x a m p l e , in (7b) the parser will not analyze the first i / t h e y as an SBAR node since there is

no A U X n o d e to trigger the start of a s e n t e n c e , and

t h e r e f o r e , the words will not be e x p u n g e d by either the

C a t e g o r y or the Stack editor Such cases where ',he Surface Copy Editor m u s t apply are rare, and it may t h e r e f o r e be that there exists an optimal parser g r a m m a r that would make the Surface C o p y Editor r e d u n d a n t ; all strings would

be edited by the syntactically based C a t e g o r y and Stack Copy rules H o w e v e r , it seems that the Surface Copy Editor must exist at some stage in the process of syntactic acquisition The overlap b e t w e e n it and the o t h e r rules may

be essential in iearning

C A T E G O R Y C O P Y E D I T O R This copy editor matches syntactic constituents in the first two positions in the parser's buffer of c o m p l e t e constituents When the first

w i n d o w position ends with an editing signal and the first and second constituents in the window are of the same type, the first is e x p u n g e d For e x a m p l e , in sentence (8) the first

of two d e t e r m i n e r s s e p a r a t e d by an editing signal is

e x p u n g e d and the first of two verbs is similarly e x p u n g e d (8) I was just t h a t the kind of guy that d i d n ' t h a v e - -

like to have people worrying

S T A C K C O P Y E D I T O R If the first constituent in the window is p r e c e d e d by an editing signal, the Stack Copy Editor looks into the stack for a constituent of the same type, and e x p u n g e s any copy it finds there along with all descendants (In the current i m p l e m e n t a t i o n , the Stack Copy Editor is allowed to look at successive nodes in the stack, back to the first C O M P node or attention shifting boundary If it finds a copy, it expunges that copy along with any nodes that are at a shallower level in the stack If Fidditch were allowed to attach of incomplete constituents, the Stack Copy E d i t o r could be i m p l e m e n t e d to delete the copy only, without searching through the stack The specifics of the i m p l e m e n t a t i o n s e e m s not to matter for this discussion of the editing rules.) In sentence (9), the initial

e m b e d d e d sentence is e x p u n g e d by the Stack Copy Editor (9) I think that y o u g e t - - it's more strict in Catholic schools

Trang 4

4 2 A n E x a m p l e

It will be useful to look a little m o r e closely at the

o p e r a t i o n of the parser to see the editing rules at work

S e n t e n c e (10)

(10) I the the guys that I'm was telling you a b o u t

w e r e

includes t h r e e editing signals which trigger the copy editors

( n o t e also t h a t the c o m p l e m e n t of w e r e is ellipted.) I will

show a trace of the p a r s e r at each of these c o r r e c t i o n stages

T h e first e d i t o r t h a t c o m e s into play is the Surface C o p y

E d i t o r , which s e a r c h e s for identical strings on e i t h e r side of

an editing signal, a n d e x p u n g e s the first copy This is d o n e

once for each s e n t e n c e , b e f o r e any lexical c a t e g o r y

a s s i g n m e n t s are m a d e T h u s in effect, the Surface C o p y

E d i t o r c o r r e s p o n d s to a p h o n e t i c / p h o n o l o g i c a l m a t c h i n g

o p e r a t i o n , a l t h o u g h it is in fact an o r t h o g r a p h i c p r o c e d u r e

b e c a u s e we are dealing with t r a n s c r i p t i o n s O b v i o u s l y , a

full u n d e r s t a n d i n g of the self-correction system calls for

d e t a i l e d p h o n e t i c / p h o n o l o g i c a l investigations

A f t e r the Surface C o p y E d i t o r has applied, the string

that the lexical a n a l y z e r sees is (11)

(11) I the guys t h a t I'm was telling you a b o u t were

r a t h e r than (10) Lexical a s s i g n m e n t s are m a d e , and the

parser p r o c e e d s to build the tree structures A f t e r s o m e

processing, the c o n f i g u r a t i o n of the data s t r u c t u r e s is that

s h o w n in Figure 1

5

4

3

2

eUi'l'ellt

NODE STACK

N P < I - >

NP < the guys >

• • ATTENSHIFT< <

N P < I >

AUX < am •

B e f o r e d e t e r m i n i n g w h a t n e x t rule to apply, the two editing rules c o m e into play, the C a t e g o r y E d i t o r and the Stack

E d i t o r A t this pulse, the Stack E d i t o r will apply b e c a u s e the first c o n s t i t u e n t in the w i n d o w is the s a m e (an A U X

n o d e ) as the c u r r e n t active n o d e , and the c u r r e n t n o d e ends with an edit signal A s a result, the first w i n d o w e l e m e n t is

p o p p e d into a n o t h e r d i m e n s i o n , l e a v i n g the the parser data

s t r u c t u r e s in the state s h o w n in F i g u r e 2

Parsing of the s e n t e n c e p r o c e e d s , a n d e v e n t u a l l y r e a c h e s the state s h o w n in F i g u r e 3 w h e r e the Stack E d i t o r

c o n d i t i o n s are a g a i n met T h e c u r r e n t active n o d e a n d the first e l e m e n t in the w i n d o w are b o t h NPs, a n d the active

n o d e c a d s with an edit signal This causes the c u r r e n t n o d e

to be e x p u n g e d , l e a v i n g only a single NP n o d e , the o n e in the w i n d o w T h e final analysis of the s e n t e n c e , after s o m e

m o r e p r o c e s s i n g is the tree s h o w n in F i g u r e 4

I s h o u l d r e e m p h a s i z e t h a t the status of the edited

e l e m e n t s is special T h e copy editing rules r e m o v e a

c o n s t i t u e n t , no m a t t e r h o w large, f r o m the view of the parser T h e parser c o n t i n u e s as if those words h a d not b e e n said A l t h o u g h the e x p u n g e d c o n s t i t u e n t s may be a v a i l a b l e for s e m a n t i c i n t e r p r e t a t i o n , they do not f o r m part of the main p r e d i c a t i o n

NODE STACK current ENP< I - ' > ]

COMPLETE NODES IN WINDOW INP< theguys> ] SBAR < that.-.> I AUX< were> I

F i g u r e 3 T h e p a r s e r state b e f o r e the s e c o n d aFplication of the Stack C o p y Editor

COMPLETE NODES IN WINDOW

[ A U X < w a s > ] V < t e l l i n g > I P R O N < y o u > ]

Figure 1 T h e parser state b e f o r e the

Stack C o p y E d i t o r applies

4

3

2

current

N O D E S T A C K

NP < the guys >

COMPLETE NODES IN WINDOW

I AUX< was> I V < telling> [ PRON< Y°U> 1

Figure 2 The parser state after

Stack C o p y Editing the A U X node

N P N P

D E T E R D A R T the

N O M N p[

N guy

S B A R

C O M P

C M P that

N P t

S

NP PRON I AUX TNS PAST s

be

+ in$

VP

V tell

N P P R O N you

P R E P about

N P t

A U X THS PAST pl

VP V be

F i g u r e 4, T h e final analysis of s e n t e n c e (10)

226

Trang 5

4.3 Restarts

A s o m e w h a t d i f f e r e n t sort of self-correction, less

sensitive to syntactic structure and flagged not only bY the

editing signal but also by a lexical item, is the restart A

restart triggers the expunction of all words f r o m the edit

signal back to the beginning of the s e n t e n c e It is signaled

by a standard edit signal f o l l o w e d by a specific lexical item

d r a w n f r o m a set including well, ok see, you know, like I

said, etc For e x a m p l e ,

(12a) That's the way if well e v e r y b o d y was so stoned,

anyway

(12b) But when l was young I went in oh I was n'ineteen

years old

It s e e m s likely that, in addition to the lexical signals,

specific intonational signals may also be involved in

restarts

5 A sample

The editing system I have described has b e e n applied to

a corpus of over twenty hours of transcribed speech, in the

process of using the parser to search for various syntactic

constructions Tht~ transcripts are of sociolinguistic

interviews of the sort d e v e l o p e d by L a b o r and designed to

elicit unreflecting speech that a p p r o x i m a t e s natural

c o n v e r s a t i o n " They are conversational interviews covering

a range of topics, and they typically include considerable

non-fluency (Over half the sentences in one 90 minute

interview contained at least one non-fluency)

The transcriptions are in standard o r t h o g r a p h y , with

sentence b o u n d a r i e s indicated The alternation of s p e a k e r s '

turns is indicated, but overlap is not Editing signals, when

noted by the transcriber, are indicated in the transcripts

with a double dash It is clear that this approach to

t r a n s c r i p t i o n only imperfectly reflects the phonetics of

editing signals; we c a n ' t be sure to what extent the editing

signals in our transcripts r e p r e s e n t facts about production

and to what extent they r e p r e s e n t facts about perception

Nevertheless, except for a general t e n d e n c y toward

u n d e r r e p r e s e n t a t i o n , there seems to be no systematic bias in

our transcriptions of the editing signals, and t h e r e f o r e our

findings are not likely to be u n d o n e by a better

u n d e r s t a n d i n g of the phonetics of self-correction

One major p r o b l e m in analyzing the syntax of English is

the multiple category m e m b e r s h i p of words In general,

most decisions about category m e m b e r s h i p can be made on

the basis of local context H o w e v e r , by its nature, self-

correction disrupts the local context, and t h e r e f o r e the

disambiguation of lexical categories b e c o m e s a more

difficult problem It is not clear w h e t h e r the rules for

category disambiguation extend across an editing signal or

not The results I present d e p e n d on a successful

disambiguation of the syntactic categories, though the

algorithm to accomplish this is not completely specified

Thus, to test the self-correction routines I have, where

necessary, i m p o s e d the p r o p e r category assignment

Table 1 shows the result of this editing system in the

parsing of the interview transcripts from one speaker All

in all this shows the editing system to be quite successful in

resolving non-fluencies

The interviews for this study were conducted by Tony Kroch and by

Anne Bower

T A B L E 1 S E L F - C O R R E C T I O N R U L E A P P L I C A T I O N

total s e n t e n c e s total s e n t e n c e s with no edit signal

1512

1108 (73%)

e x p u n c t i o n of

r e m a i n i n g unclear

6 Discussion

A l t h o u g h the editing rules for Fidditch are written as deterministic pattern-action rules of the same sort as the rules in the parsing g r a m m a r , their o p e r a t i o n is in a sense isolable The patterns of the self-correction rules are

c h e c k e d first, before any of the g r a m m a r rule patterns are

c h e c k e d , at each step in the parse Despite this

i n d e p e n d e n c e in terms of rule o r d e r i n g , the o p e r a t i o n of the self-correction c o m p o n e n t is closely tied to the g r a m m a r

of the parser; for it is the parsing g r a m m a r that specifies what sort of constituents count as the same for copying

For e x a m p l e , if the g r a m m a r did not treat there as a noun

phrase when it is subject of a s e n t e n c e , the self-correction rules could not properly resolve a s e n t e n c e like

(13) People t h e r e ' s a lot of people from K e n n s i n g t o n because the editing rules would n e v e r recognize that people and there are the same sort of e l e m e n t (Note that (13)

cannot be treated as a Restart because the lexical trigger is not present.) Thus, the o b s e r v e d pattern of self-correction introduces empirical constraints on the set of features that are available for syntactic rules

The self-correction rules impose constraints not only on what linguistic e l e m e n t s must count as the same, but also on what must count as different For e x a m p l e , in sentence

(14), could and be must be recognized as d i f f e r e n t sorts of

e l e m e n t s in the g r a m m a r for the A U X node to be correctly resolved If the g r a m m a r assigned the two words exactly the same part of speech, then the Category Cc'gy Editor

would necessarily apply, incorrectly expunging could

(14) Kid could be a brain in school

It appears t h e r e f o r e that the pattern of self-corrections that occur r e p r e s e n t s a potentially rich source of evidence about the nature of syntactic categories

Learnability If the patterns of self-correction count as evidence about the nature of syntactic categories for the linguist, then this data must be equally available to the language learner This would suggest that, far f r o m being

an i m p e d i m e n t to language learning, non-fluencies may in fact facilitate language acquisition bv highlighting equivalent classes

Trang 6

This raises the general question of how c h i l d r e n can

acquire a l a n g u a g e in the face of u n r e s t r a i n e d n o n - f l u e n c y

H o w can a l a n g u a g e l e a r n e r sort out the g r a m m a t i c a l f r o m

the u n g r a m m a t i c a l strings? ( T h e n o n - f l u e n c i e s of s p e e c h

are of c o u r s e but one aspect of the d e g e n e r a c y of i n p u t that

m a k e s l a n g u a g e acquisition a puzzle.) T h e self-correction

system I h a v e d e s c r i b e d suggests that m a n y n o n - f l u e n t

strings can be r e s o l v e d with little detailed linguistic

k n o w l e d g e

As T a b l e 1 shows, a b o u t a q u a r t e r of the editing signals

r e s u l t in e x p u n c t i o n of only non-linguistic material This

r e q u i r e s only an ability to distinguish linguistic f r o m non-

linguistic stuff, and it i n t r o d u c e s the idea t h a t edit signals

signal an e x p u n c t i o n site A l m o s t a third are resolved by

the Surface C o p y i n g rule, which can be viewed simply as an

instance of the g e n e r a l non-linguistic rule t h a t multiple

instances of the s a m e t h i n g c o u n t as a single instance The

c a t e g o r y c o p y i n g rules are g e n e r a l i z a t i o n s of simple

c o p y i n g , applied to a k n o w l e d g e of linguistic categories,

M a k i n g the t r a n s i t i o n f r o m surface copies to c a t e g o r y copies

is aided by the fact t h a t t h e r e is c o n s i d e r a b l e o v e r l a p in

c o v e r a g e , d e f i n i n g a path of e x p a n d i n g g e n e r a l i z a t i o n

T h u s at the earliest stages of l e a r n i n g , only the simplest,

non-linguistic self-correction rules would c o m e into play,

and gradually the more syntactically i n t e g r a t e d would be

acquired

C o n t r a s t this self-correction system to an a p p r o a c h that

h a n d l e s n o n - f l u e n c i e s by some general p r o b l e m solving

r o u t i n e s , for e x a m p l e G r a n g e r (1982), who proposes

r e a s o n i n g from what a s p e a k e r m i g h t be e x p e c t e d to say

Besides the o b v i o u s inefficiencies of general p r o b l e m

solving a p p r o a c h e s , it is worth giving special e m p h a s i s to

the p r o b l e m with l e a r n a b i l i t y A general p r o b l e m solving

a p p r o a c h d e p e n d s crucially on e v a l u a t i n g the l i k e l i h o o d of

possible d e v i a t i o n s f r o m the n o r m s But a l a n g u a g e l e a r n e r

has by definition only partial and possibly i n c o r r e c t

k n o w l e d g e of the syntax, and is t h e r e f o r e u n a b l e to

consistently identify d e v i a t i o n s f r o m the g r a m m a t i c a l

system With the editing system I describe, the l e a r n e r n e e d

not h a v e the ability to r e c o g n i z e d e v i a t i o n s f r o m

g r a m m a t i c a l n o r m s , but merely the non-linguistic ability to

r e c o g n i z e copies of the same thing

c o r r e c t i o n c o m p o n e n t f r o m the s t a n d p o i n t of parsing

H o w e v e r , it is clear that the origins are in the process of

g e n e r a t i o n The m e c h a n i s m for editing self-corrections that

I have p r o p o s e d has as its essential o p e r a t i o n e x p u n g i n g one

of two identical elements It is u n a b l e to e x p u n g e a

s e q u e n c e of two e l e m e n t s (The Surface C o p y E d i t o r m i g h t

be viewed as a c o u n t e r e x a m p l e to this claim, but see

below.) C o n s i d e r e x p u n c t i o n now f r o m the s t a n d p o i n t of

the g e n e r a t o r Suppose self-correction bears a o n e - t o - o n e

r e l a t i o n s h i p to a possible action of the g e n e r a t o r (initiated

by some m o n i t o r i n g c o m p o n e n t ) which could be called

A B A N D O N C O N S T R U C T X A n d suppose that this

action can be initiated at any time up until C O N S T R U C T X

is c o m p l e t e d , w h e n a signal is r e t u r n e d that the c o n s t r u c t i o n

is c o m p l e t e F u r t h e r suppose that A B A N D O N

C O N S T R U C T X causes an editing signal W h e n the

s p e a k e r decides in the middle of some linguistic e l e m e n t to

a b a n d o n it and start again, an editing signal is produced

If this is an a p p r o p r i a t e model, then the e l e m e n t s which

are self-corrected should be exactly those e l e m e n t s that

exist at some stage in the g e n e r a t i o n process T h u s , we

s h o u l d be able to find e v i d e n c e for the units i n v o l v e d in

g e n e r a t i o n by l o o k i n g at the data of self-correction A n d

i n d e e d , such e v i d e n c e s h o u l d be a v a i l a b l e to the l a n g u a g e

l e a r n e r as well

Summary

I h a v e d e s c r i b e d the n a t u r e of self-corrected s p e e c h ( w h i c h is a m a j o r s o u r c e of s p o k e n n o n f l u e n c i e s ) and h o w

it can be r e s o l v e d by simple editing rules within the c o n t e x t

of a d e t e r m i n i s t i c parser T w o f e a t u r e s are essential to the

s e l f - c o r r e c t i o n system: I) e v e r y s e l f - c o r r e c t i o n site ( w h e t h e r

it results in the e x p u n c t i o n of words or not) is m a r k e d by a

p h o n e t i c a l l y i d e n t i f i a b l e signal placed at the r i g h t edge of the p o t e n t i a l e x p u n c t i o n site; and 2) the e x p u n g e d part is the l e f t - h a n d m e m b e r of a pair of copies, one on each side

of the editing signal T h e copies may be of t h r e e types: 1) identical surface strings, which are edited by a m a t c h i n g rule t h a t applies b e f o r e syntactic analysis b e g i n s ; 2)

c o m p l e t e c o n s t i t u e n t s , w h e n two c o n s t i t u e n t s of the s a m e type a p p e a r in the p a r s e r ' s b u f f e r ; or 3) i n c o m p l e t e

c o n s t i t u e n t s , w h e n the p a r s e r finds itself trying to c o m p l e t e

a c o n s t i t u e n t of the same type as a c o n s t i t u e n t it has just

c o m p l e t e d W h e n e v e r two such copies a p p e a r in such a

c o n f i g u r a t i o n , and the first one ends with an editing signal, the first is e x p u n g e d f r o m f u r t h e r analysis This editing system has b e e n i m p l e m e n t e d as part of a d e t e r m i n i s t i c parser, and tested on a wide r a n g e of s e n t e n c e s f r o m

t r a n s c r i b e d speech F u r t h e r study of the s e l f - c o r r e c t i o n system p r o m i s e s to p r o v i d e insights into t h e units of

p r o d u c t i o n and the n a t u r e of linguistic categories

Acknowledgements

My t h a n k s to T o n y K r o c h , Mitch Marcus, and Ken

C h u r c h for helpful c o m m e n t s on this work

References

F r o m k i n , Victoria A ed 1980 Errors in Linguistic Performance: Slips of the Tongue Ear Pen and Hand

A c a d e m i c Press: New York

G r a n g e r , R i c h a r d H 1982 Scruffy T e x t U n d e r s t a n d i n g : Design and I m p l e m e n t a t i o n of ' T o l e r a n t ' U n d e r s t a n d e r s

Proceedings of the 20th Annual Meeting of the ACL

H a y e s , Philip I and G e o r g e V M o u r a d i a n 1981 Flexible Parsing American Journal of Computational Linguistics 7.4, 232-242

J'efferson, Gall 1974 E r r o r c o r r e c t i o n as an

i n t e r a c t i o n a l resource Language in Society 2:181-199

Kroch, A n t h o n y and D o n a l d Hindle 1981 A

quantitative study of the syntax of speech and writing Final

r e p o r t to the N a t i o n a l Institute of E d u c a t i o n , g r a n t 78-0169

L a b o r , William 1966 O n the g r a m m a t i c a l i t y of

e v e r y d a y speech P a p e r p r e s e n t e d at the Linguistic Society

of A m e r i c a a n n u a l m e e t i n g Marcus, Mitchell P 1980 A Theory of Syntactic Recognition for Natural Language MIT Press: C a m b r i d g e ,

MA

T h o m p s o n , B o z e n a H 1980 A linguistic analysis of

n a t u r a l l a n g u a g e c o m m u n i c a t i o n with c o m p u t e r s

computational linguistics

128

Định dạng
Số trang	6
Dung lượng	543,89 KB