I will begin by discussing the range of non-fluencies that occur in speech... The minimal non-lexical material that self-correction might insert is the editing signal itself... A s with
Trang 1D e t e r m i n i s t i c P a r s i n g of S y n t a c t i c N o n - f l u e n c i e s
D o n a l d H i n d l e
Bell L a b o r a t o r i e s Murray Hill, New J e r s e y 07974
It is often r e m a r k e d that natural language, used
naturally, is unnaturally u n g r a m m a t i c a l * S p o n t a n e o u s
speech contains all m a n n e r of false starts, hesitations, and
self-corrections that disrupt the w e l l - f o r m e d n e s s of strings
It is a mystery then, that despite this a p p a r e n t wide
deviation f r o m grammatical n o r m s , people have little
difficx:lty u n d e r s t a n d i n g the n o n - f l u e n t speech that is the
essential m e d i u m of e v e r y d a y life A n d it is a still greater
mystery that children can succeed in acquiring the g r a m m a r
of a language on the basis of evidence provided by a mixed
set of apparently grammatical and u n g r a m m a t i c a l strings
I Sell-correction: a Rule-governed System
In this paper I present a system of rules for resolving the
non-fluencies of speech, i m p l e m e n t e d as part of a
computational model of syntactic processing The essential
idea is that non-fluencies occur when a speaker corrects
s o m e t h i n g that he or she has already said out loud Since
words once said cannot be unsaid, a s p e a k e r can only
accomplish a self-correction by saying s o m e t h i n g additional
namely the i n t e n d e d words The i n t e n d e d words are
supposed to substitute for the wrongly p r o d u c e d words
For e x a m p l e , in sentence (1), the s p e a k e r initially said I but
m e a n t we
(1) I was we were hungry
The p r o b l e m for the hearer, as for any natural language
u n d e r s t a n d i n g system, is to d e t e r m i n e what words are to be
e x p u n g e d from the actual words said to find the i n t e n d e d
sentence
Labov (1966) provided the key to solving this p r o b l e m
when he noted that a phonetic signal (specifically, a
markedly abrupt cut-off of the speech signal) always marks
the site where self-correction takes place Of course,
finding the site of a self-correction is only half the p r o b l e m ;
it remains to specify what should be r e m o v e d A first guess
suggests that this must be a non-deterministic p r o b l e m ,
requiring complex reasoning about what the speaker meant
to say Labov claimed that a simple set of rules operating
on the surface string would specify exactly what should b e
c h a n g e d , t r a n s f o r m i n g nearly all non-fluent strings into
fully grammatical sentences The specific set of
transformational rules L a b o r p r o p o s e d were not formally
a d e q u a t e , in part because they were surface t r a n s f o r m a t i o n s
which ignored syntactic c o n s t i t u e n t h o o d But his work
forms the basis of this current analysis
This research was d o n e for the most part at the U n i v e r s i t y of
P e n n s y l v a m a s u p p o r t e d by the National Institute of E d u c a t i o n under
g r a n t s GTg-0169 and G80-0163
L a b o r ' s claim was not of course that u n g r a m m a t i c a l
s e n t e n c e s are n e v e r p r o d u c e d in s p e e c h , for that clearly would be false R a t h e r , it s e e m s that truly u n g r a m m a t i c a l
p r o d u c t i o n s r e p r e s e n t only a tiny fraction of the s p o k e n output, and in the p r e p o n d e r a n c e of cases, an a p p a r e n t ungrammaticality can be resolved by simple editing rules In
o r d e r to make sense of n o n - f l u e n t speech, it is essential that the various types of grammatical deviation be distinguished This point has s o m e t i m e s been missed, and
f u n d a m e n t a l l y d i f f e r e n t kinds of deviation from standard grammaticality have been treated t o g e t h e r because they all
p r e s e n t the same sort of p r o b l e m for a natural language
u n d e r s t a n d i n g system For e x a m p l e , H a y e s and Mouradian (1981) mix t o g e t h e r speaker-initiated self-corrections with
f r a g m e n t a r y s e n t e n c e s of all sorts:
people often leave out or r e p e a t words or phrases, break off what they are saying and r e p h r a s e or replace it, speak in f r a g m e n t s , or o t h e r w i s e use incorrect g r a m m a r (1981:231)
Ultimately, it will be fluent productions on are fully grammatical other A l t h o u g h we characterization of
essential to distinguish b e t w e e n non- the one hand, and constructions that though not yet u n d e r s t o o d , on the may not know in detail the correct such processes as ellipsis and conjunction, they are without d o u b t fully productive grammatical processes Without an u n d e r s t a n d i n g of the
d i f f e r e n c e s in the kinds of non-fluencies that occur, we are left with a kind of grab bag of grammatical deviation that can n e v e r be analyzed except by some sort of general purpose m e c h a n i s m s
In this paper, I want to characterize the subset of s p o k e n non-fluencies that can be treated as self-corrections, and to describe h o w they are handled in the context of a deterministic parser I assume that a system for dealing with self-corrections similar to the one I describe must be a part of the c o m p e t e n c e of any natural language user I will begin by discussing the range of non-fluencies that occur in speech T h e n , after reviewing the notion of deterministic parsing, I will describe the model of parsing self-corrections
in detail, and r e p o r t results from a sample of 1500 sentences Finally, I discuss some implications of this theory of self-correction, particularly for the p r o b l e m of language acquisition
2 Errors in Spontaneous Speech
Linguists have been of less help in describing the nature
of spoken non-fluencies than might have been h o p e d ; relatively little attention has been d e v o t e d to the actual
p e r f o r m a n c e of speakers, and studies that claim to be based
Trang 2on p e r f o r m a n c e data s e e m to ignore the p r o b l e m of non-
fluencies ( N o t a b l e e x c e p t i o n s include F r o m k i n (1980), and
T h o m p s o n (1980)) For the discussion of self-correction, I
want to distinguish three types of n o n - f l u e n c i e s that
typically occur in s p e e c h
1 Unusual C o n s t r u c t i o n s It is p e r h a p s worth
e m p h a s i z i n g that the m e r e fact that a p a r s e r does not h a n d l e
a construction, or that linguists have not discussed it, d o e s
not m e a n that it is u n g r a m m a t i c a l In s p e e c h , there is a
range of m o r e or less unusual constructions which occur
productively (some occur in writing as well), and which
c a n n o t be c o n s i d e r e d syntactically ill-formed For e x a m p l e ,
(2a) I i m a g i n e t h e r e ' s a lot of t h e m must have had s o m e
good r e a s o n s not to go there
(2b) T h a t ' s the only thing he does is fight
Sentence (2a) is an e x a m p l e of n o n - s t a n d a r d subject relative
clauses that are c o m m o n in speech S e n t e n c e (2b), which
s e e m s to have two t e n s e d "be" verbs in one clause is a
p r o d u c t i v e s e n t e n c e type that occurs regularly, t h o u g h
rarely, in all sorts of s p o k e n discourse (see K r o c h and
Hindle 1981) I a s s u m e that a correct and c o m p l e t e
g r a m m a r for a parser will have to deal with all grammatical
processes, marginal as well as central I have n o t h i n g
f u r t h e r to say about unusual constructions h e r e
2 True U n g r a m m a t i c a l / t i e s A small p e r c e n t a g e of
s p o k e n utterances are truly u n g r a m m a t i c a l That is, they do
not result f r o m any regular grammatical process ( h o w e v e r
rare), nor are they instances of successful self-correction
U n e x c e p t i o n a b l e e x a m p l e s are hard to find, but the
following give the flavor
(3a) I've seen it h a p p e n is two girls fight
(3b) T o d a y if you beat a guy wants to blow your h e a d
off for s o m e t h i n g
(3c) A n d aa a lot of the kids that are f r o m our
n e i g h b o r h o o d - - t h e r e ' s one section that the kids a r e n ' t
too think they would usually the the o n e s that were
the the d r o p outs and the s t o n e h e a d s
L a b o v (1966) r e p o r t e d that less that 2% of the s e n t e n c e s in
a sample of a variety of types of conversational English
were u n g r a m m a t i c a l in this sense, a result that is c o n f i r m e d
by c u r r e n t work ( K r o c h and Hindle 1981)
3 Self-corrected strings This type of non-fluency is the
focus of this paper Self-corrected strings all have the
characteristic that some e x t r a n e o u s material was a p p a r e n t l y
inserted, and that e x p u n g i n g some substring results in a
w e l l - f o r m e d syntactic structure, which is apparently
consistent with the m e a n i n g that is i n t e n d e d
In the d e g e n e r a t e case, self-correction inserts non-lexical
material, which the syntactic p r o c e s s o r ignores, as in (4)
(aa) He was uh still asleep
(4b) I didn't ko go right into college
The minimal non-lexical material that self-correction might
insert is the editing signal itself O t h e r cases ( e x a m p l e s 6-
10 below) are only i n t e r p r e t a b l e given the a s s u m p t i o n that
certain words, which are potentially part of the syntactic
structure, are to be r e m o v e d f r o m the syntactic analysis
The status of the material that is c o r r e c t e d by self-
c o r r e c t i o n and is e x p u n g e d by the editing rules is s o m e w h a t
odd I use the t e r m expunction to m e a n that it is r e m o v e d
f r o m any f u r t h e r syntactic analysis This d o e s not m e a n
h o w e v e r that a s e l f - c o r r e c t e d string is unavailable for
s e m a n t i c processing A l t h o u g h the s e l f - c o r r e c t e d string is
edited f r o m the syntacti c analysis, it is n e v e r t h e l e s s available for s e m a n t i c i n t e r p r e t a t i o n J e f f e r s o n (1974) discusses the e x a m p l e
(5) [thuh] [thiy] o f f i c e r
w h e r e the initial, s e l f - c o r r e c t e d string (with the pre-
c o n s o n a n t a l f o r m of the r a t h e r than the pre-vocalic f o r m )
m a k e s it clear that the s p e a k e r originally inteTided to r e f e r
to the police by s o m e w o r d o t h e r than officer
I should also note that the p r o b l e m s a d d r e s s e d by the
s e l f - c o r r e c t i o n c o m p o n e n t that I am c o n c e r n e d with are only part of the kind of d e v i a n c e that occurs in natural language use Many types of naturally occurring e r r o r s are not part of this s y s t e m , for e x a m p l e , p h o n o l o g i c a l and
s e m a n t i c errors It is r e a s o n a b l e to h o p e that much of this dreck will be h a n d l e d by similar s u b s y s t e m s Of course,
t h e r e will always r e m a i n e r r o r s that are outside of any system But we e x p e c t that the a p p a r e n t chaos is much
m o r e regular than it at first a p p e a r s and that it can be
m o d e l e d by the interaction of c o m p o n e n t s that are
t h e m s e l v e s simple
In the following discussion, I use the terms self- correction and editing m o r e or less i n t e r c h a n g e a b l y , though
the two terms e m p h a s i z e the g e n e r a t i o n and i n t e r p r e t a t i o n aspects of the same process
3 T h e Parser
The editing system that I will describe is i m p l e m e n t e d on
top of a d e t e r m i n i s t i c parser, called Fidditch based on the
processing principles p r o p o s e d by Marcus (1980) It takes
as input a s e n t e n c e of s t a n d a r d words and returns a labeled bracketing that r e p r e s e n t s the syntactic structure as an
a n n o t a t e d tree structure Fidditch w a s ' d e s i g n e d to process transcripts of s p o n t a n e o u s s p e e c h , and to produce an analysis, partial if necessary, for a large corpus of i n t e r v i e w transcripts Because Jris a d e t e r m i n i s t i c parser, it produces only one analysis for each s e n t e n c e W h e n Fidditch is unable to build larger constituents out of s u b p h r a s e s , it
m o v e s on to the next c o n s t i t u e n t of the s e n t e n c e
In brief, the parsing process p r o c e e d s as follows The words in a t r a n s c r i b e d s e n t e n c e ( w h e r e s e n t e n c e m e a n s one tensed clause t o g e t h e r with all s u b o r d i n a t e clauses) are assigned a lexical c a t e g o r y (or set of lexical categories) on the basis of a 2000 word lexicon and a m o r p h o l o g i c a l analyzer The lexicon contains, for each w o r d , a list of possible lexical categories, s u b c a t e g o r i z a t i o n i n f o r m a t i o n , and in a few cases, i n f o r m a t i o n on c o m p o u n d words For
e x a m p l e , the entry for round states that it is a n o u n , verb,
adjective or p r e p o s i t i o n , that as a verb it is s u b c a t e g o r i z e d
for the m o v a b l e particles out and up and for NP, and that it may be part of the c o m p o u n d a d j e c t i v e / p r e p o s i t i o n round about
Once the lexical analysis is c o m p l e t e , The phrase structure tree is c o n s t r u c t e d on the basis of pattern-action rules using two internal data structures: 1) a p u s h - d o w n stack of i n c o m p l e t e n o d e s , and 2) a buffer of c o m p l e t e constituents, into which the g r a m m a r rules can look t h r o u g h
124
Trang 3a w i n d o w of three constituents The parser matches rule
patterns to the configuration of the w i n d o w and stack Its
basic actions include
- - starting to build a new n o d e by pushing a category onto
the stack
- - attaching the first e l e m e n t of the window to the stack
- - d r o p p i n g subtrees f r o m the stack into the first position in
the w i n d o w when they are c o m p l e t e
The parser p r o c e e d s deterministically in the sense that no
aspect of the tree structure, once built may be altered by
any rule (See Marcus 1980 for a c o m p r e h e n s i v e discussion
of this theory of parsing.)
4 T h e s e r f - c o r r e c t i o n r u l e s
The self-correction rules specify h o w much, if anything,
to e x p u n g e when an editing signal is detected The rules
d e p e n d crucially on being able to r e c o g n i z e an editing
signal, for that marks the right edge of an expunction site
For the p r e s e n t discussion, I will assume little about the
phonetic nature of the signal except that it is phonetically
recognizable, and that, w h a t e v e r their phonetic nature, all
editing signals are, for the self-correction system,
equivalent Specifying the nature of the editing signal is,
obviously, an area where further research is n e e d e d
The only action that the editing rules can p e r f o r m is
e x p u n c t i o n , by which I mean r e m o v i n g an e l e m e n t from the
view of the parser The rules never replace one e l e m e n t
with a n o t h e r or insert an e l e m e n t in the parser data
structures H o w e v e r , both r e p l a c e m e n t s and insertions can
be accomplished within the self-correction system by
expunction of partially identical strings For e x a m p l e , in
(6) I am I was really a n n o y e d
The self-correction rules will e x p u n g e the I a m which
p r e c e d e s the editing signal, t h e r e b y in effect replacing a m
with w a s and inserting r e a l l y
Self-corrected strings can be viewed formally as having
extra material inserted, but not involving either deletion or
r e p l a c e m e n t of material The linguistic system does seem to
make use of both deletions and r e p l a c e m e n t s in other
subsystems of g r a m m a r h o w e v e r , namely in ellipsis and
rank s h i f t A s with the editing system, these are not errors
but formal systems that interact with the central features of
the syntax True errors do of course occur involving all
three logical possibilities (insertion, deletion, and
r e p l a c e m e n t ) but these are relatively rare
The self-correction rules have access to the internal data
structures of the parser, and like the parser itself, they
overate deterministicallv The parser views the editing
signal as occurring at the e n d of a constituent, because it
marks the r i g h t edge of an e x p u n g e d e l e m e n t There are
two types of editing rules in the system: expunction of
copies, for which there are three rules, and lexically
triggered restarts, for which there is one rule
4 1 C o p y E d i t i n g
The copying rules say that if you have two e l e m e n t s
which are the same and they are separated by an editing
signal, the first should be e x p u n g e d from the structure
Obviously the trick here is to d e t e r m i n e what counts as
copies T h e r e are three specific places w h e r e copy editing applies
S U R F A C E C O P Y E D I T O R This is essentially a non- syntactic rule that matches the surface string on either side
of the editing signal, and e x p u n g e s the first copy It applies to the surface string (i.e., for transcripts, the
o r t h o g r a p h i c string) before any syntactic p r o c t i , ~ For
e x a m p l e , in (7), the u n d e r l i n e d strings are e x p u n g e d before parsing begins
(7a) Well i f t h e y ' d - - if t h e y ' d had a knife 1 w o u - - I
w o u l d n ' t be here today
(Tb) l f t h e y - - if they could do it
Typically, the Surface C o p y E d i t o r e x p u n g e s a string of words that would later be analyzed as a constituent (or partial constituent), and would be e x p u n g e d by the
C a t e g o r y or the Stack E d i t o r s (as in 7a) H o w e v e r the string that is e x p u n g e d by the Surface Copy Editor need not
be d o m i n a t e d by a single n o d e ; it can be a s e q u e n c e of unrelated constituents For e x a m p l e , in (7b) the parser will not analyze the first i / t h e y as an SBAR node since there is
no A U X n o d e to trigger the start of a s e n t e n c e , and
t h e r e f o r e , the words will not be e x p u n g e d by either the
C a t e g o r y or the Stack editor Such cases where ',he Surface Copy Editor m u s t apply are rare, and it may t h e r e f o r e be that there exists an optimal parser g r a m m a r that would make the Surface C o p y Editor r e d u n d a n t ; all strings would
be edited by the syntactically based C a t e g o r y and Stack Copy rules H o w e v e r , it seems that the Surface Copy Editor must exist at some stage in the process of syntactic acquisition The overlap b e t w e e n it and the o t h e r rules may
be essential in iearning
C A T E G O R Y C O P Y E D I T O R This copy editor matches syntactic constituents in the first two positions in the parser's buffer of c o m p l e t e constituents When the first
w i n d o w position ends with an editing signal and the first and second constituents in the window are of the same type, the first is e x p u n g e d For e x a m p l e , in sentence (8) the first
of two d e t e r m i n e r s s e p a r a t e d by an editing signal is
e x p u n g e d and the first of two verbs is similarly e x p u n g e d (8) I was just t h a t the kind of guy that d i d n ' t h a v e - -
like to have people worrying
S T A C K C O P Y E D I T O R If the first constituent in the window is p r e c e d e d by an editing signal, the Stack Copy Editor looks into the stack for a constituent of the same type, and e x p u n g e s any copy it finds there along with all descendants (In the current i m p l e m e n t a t i o n , the Stack Copy Editor is allowed to look at successive nodes in the stack, back to the first C O M P node or attention shifting boundary If it finds a copy, it expunges that copy along with any nodes that are at a shallower level in the stack If Fidditch were allowed to attach of incomplete constituents, the Stack Copy E d i t o r could be i m p l e m e n t e d to delete the copy only, without searching through the stack The specifics of the i m p l e m e n t a t i o n s e e m s not to matter for this discussion of the editing rules.) In sentence (9), the initial
e m b e d d e d sentence is e x p u n g e d by the Stack Copy Editor (9) I think that y o u g e t - - it's more strict in Catholic schools
Trang 44 2 A n E x a m p l e
It will be useful to look a little m o r e closely at the
o p e r a t i o n of the parser to see the editing rules at work
S e n t e n c e (10)
(10) I the the guys that I'm was telling you a b o u t
w e r e
includes t h r e e editing signals which trigger the copy editors
( n o t e also t h a t the c o m p l e m e n t of w e r e is ellipted.) I will
show a trace of the p a r s e r at each of these c o r r e c t i o n stages
T h e first e d i t o r t h a t c o m e s into play is the Surface C o p y
E d i t o r , which s e a r c h e s for identical strings on e i t h e r side of
an editing signal, a n d e x p u n g e s the first copy This is d o n e
once for each s e n t e n c e , b e f o r e any lexical c a t e g o r y
a s s i g n m e n t s are m a d e T h u s in effect, the Surface C o p y
E d i t o r c o r r e s p o n d s to a p h o n e t i c / p h o n o l o g i c a l m a t c h i n g
o p e r a t i o n , a l t h o u g h it is in fact an o r t h o g r a p h i c p r o c e d u r e
b e c a u s e we are dealing with t r a n s c r i p t i o n s O b v i o u s l y , a
full u n d e r s t a n d i n g of the self-correction system calls for
d e t a i l e d p h o n e t i c / p h o n o l o g i c a l investigations
A f t e r the Surface C o p y E d i t o r has applied, the string
that the lexical a n a l y z e r sees is (11)
(11) I the guys t h a t I'm was telling you a b o u t were
r a t h e r than (10) Lexical a s s i g n m e n t s are m a d e , and the
parser p r o c e e d s to build the tree structures A f t e r s o m e
processing, the c o n f i g u r a t i o n of the data s t r u c t u r e s is that
s h o w n in Figure 1
5
4
3
2
eUi'l'ellt
NODE STACK
N P < I - >
NP < the guys >
• • ATTENSHIFT< <
N P < I >
AUX < am •
B e f o r e d e t e r m i n i n g w h a t n e x t rule to apply, the two editing rules c o m e into play, the C a t e g o r y E d i t o r and the Stack
E d i t o r A t this pulse, the Stack E d i t o r will apply b e c a u s e the first c o n s t i t u e n t in the w i n d o w is the s a m e (an A U X
n o d e ) as the c u r r e n t active n o d e , and the c u r r e n t n o d e ends with an edit signal A s a result, the first w i n d o w e l e m e n t is
p o p p e d into a n o t h e r d i m e n s i o n , l e a v i n g the the parser data
s t r u c t u r e s in the state s h o w n in F i g u r e 2
Parsing of the s e n t e n c e p r o c e e d s , a n d e v e n t u a l l y r e a c h e s the state s h o w n in F i g u r e 3 w h e r e the Stack E d i t o r
c o n d i t i o n s are a g a i n met T h e c u r r e n t active n o d e a n d the first e l e m e n t in the w i n d o w are b o t h NPs, a n d the active
n o d e c a d s with an edit signal This causes the c u r r e n t n o d e
to be e x p u n g e d , l e a v i n g only a single NP n o d e , the o n e in the w i n d o w T h e final analysis of the s e n t e n c e , after s o m e
m o r e p r o c e s s i n g is the tree s h o w n in F i g u r e 4
I s h o u l d r e e m p h a s i z e t h a t the status of the edited
e l e m e n t s is special T h e copy editing rules r e m o v e a
c o n s t i t u e n t , no m a t t e r h o w large, f r o m the view of the parser T h e parser c o n t i n u e s as if those words h a d not b e e n said A l t h o u g h the e x p u n g e d c o n s t i t u e n t s may be a v a i l a b l e for s e m a n t i c i n t e r p r e t a t i o n , they do not f o r m part of the main p r e d i c a t i o n
NODE STACK current ENP< I - ' > ]
COMPLETE NODES IN WINDOW INP< theguys> ] SBAR < that.-.> I AUX< were> I
F i g u r e 3 T h e p a r s e r state b e f o r e the s e c o n d aFplication of the Stack C o p y Editor
COMPLETE NODES IN WINDOW
[ A U X < w a s > ] V < t e l l i n g > I P R O N < y o u > ]
Figure 1 T h e parser state b e f o r e the
Stack C o p y E d i t o r applies
4
3
2
current
N O D E S T A C K
NP < the guys >
COMPLETE NODES IN WINDOW
I AUX< was> I V < telling> [ PRON< Y°U> 1
Figure 2 The parser state after
Stack C o p y Editing the A U X node
N P N P
D E T E R D A R T the
N O M N p[
N guy
S B A R
C O M P
C M P that
N P t
S
NP PRON I AUX TNS PAST s
be
+ in$
VP
V tell
N P P R O N you
P R E P about
N P t
A U X THS PAST pl
VP V be
F i g u r e 4, T h e final analysis of s e n t e n c e (10)
226
Trang 54.3 Restarts
A s o m e w h a t d i f f e r e n t sort of self-correction, less
sensitive to syntactic structure and flagged not only bY the
editing signal but also by a lexical item, is the restart A
restart triggers the expunction of all words f r o m the edit
signal back to the beginning of the s e n t e n c e It is signaled
by a standard edit signal f o l l o w e d by a specific lexical item
d r a w n f r o m a set including well, ok see, you know, like I
said, etc For e x a m p l e ,
(12a) That's the way if well e v e r y b o d y was so stoned,
anyway
(12b) But when l was young I went in oh I was n'ineteen
years old
It s e e m s likely that, in addition to the lexical signals,
specific intonational signals may also be involved in
restarts
5 A sample
The editing system I have described has b e e n applied to
a corpus of over twenty hours of transcribed speech, in the
process of using the parser to search for various syntactic
constructions Tht~ transcripts are of sociolinguistic
interviews of the sort d e v e l o p e d by L a b o r and designed to
elicit unreflecting speech that a p p r o x i m a t e s natural
c o n v e r s a t i o n " They are conversational interviews covering
a range of topics, and they typically include considerable
non-fluency (Over half the sentences in one 90 minute
interview contained at least one non-fluency)
The transcriptions are in standard o r t h o g r a p h y , with
sentence b o u n d a r i e s indicated The alternation of s p e a k e r s '
turns is indicated, but overlap is not Editing signals, when
noted by the transcriber, are indicated in the transcripts
with a double dash It is clear that this approach to
t r a n s c r i p t i o n only imperfectly reflects the phonetics of
editing signals; we c a n ' t be sure to what extent the editing
signals in our transcripts r e p r e s e n t facts about production
and to what extent they r e p r e s e n t facts about perception
Nevertheless, except for a general t e n d e n c y toward
u n d e r r e p r e s e n t a t i o n , there seems to be no systematic bias in
our transcriptions of the editing signals, and t h e r e f o r e our
findings are not likely to be u n d o n e by a better
u n d e r s t a n d i n g of the phonetics of self-correction
One major p r o b l e m in analyzing the syntax of English is
the multiple category m e m b e r s h i p of words In general,
most decisions about category m e m b e r s h i p can be made on
the basis of local context H o w e v e r , by its nature, self-
correction disrupts the local context, and t h e r e f o r e the
disambiguation of lexical categories b e c o m e s a more
difficult problem It is not clear w h e t h e r the rules for
category disambiguation extend across an editing signal or
not The results I present d e p e n d on a successful
disambiguation of the syntactic categories, though the
algorithm to accomplish this is not completely specified
Thus, to test the self-correction routines I have, where
necessary, i m p o s e d the p r o p e r category assignment
Table 1 shows the result of this editing system in the
parsing of the interview transcripts from one speaker All
in all this shows the editing system to be quite successful in
resolving non-fluencies
The interviews for this study were conducted by Tony Kroch and by
Anne Bower
T A B L E 1 S E L F - C O R R E C T I O N R U L E A P P L I C A T I O N
total s e n t e n c e s total s e n t e n c e s with no edit signal
1512
1108 (73%)
e x p u n c t i o n of
r e m a i n i n g unclear
6 Discussion
A l t h o u g h the editing rules for Fidditch are written as deterministic pattern-action rules of the same sort as the rules in the parsing g r a m m a r , their o p e r a t i o n is in a sense isolable The patterns of the self-correction rules are
c h e c k e d first, before any of the g r a m m a r rule patterns are
c h e c k e d , at each step in the parse Despite this
i n d e p e n d e n c e in terms of rule o r d e r i n g , the o p e r a t i o n of the self-correction c o m p o n e n t is closely tied to the g r a m m a r
of the parser; for it is the parsing g r a m m a r that specifies what sort of constituents count as the same for copying
For e x a m p l e , if the g r a m m a r did not treat there as a noun
phrase when it is subject of a s e n t e n c e , the self-correction rules could not properly resolve a s e n t e n c e like
(13) People t h e r e ' s a lot of people from K e n n s i n g t o n because the editing rules would n e v e r recognize that people and there are the same sort of e l e m e n t (Note that (13)
cannot be treated as a Restart because the lexical trigger is not present.) Thus, the o b s e r v e d pattern of self-correction introduces empirical constraints on the set of features that are available for syntactic rules
The self-correction rules impose constraints not only on what linguistic e l e m e n t s must count as the same, but also on what must count as different For e x a m p l e , in sentence
(14), could and be must be recognized as d i f f e r e n t sorts of
e l e m e n t s in the g r a m m a r for the A U X node to be correctly resolved If the g r a m m a r assigned the two words exactly the same part of speech, then the Category Cc'gy Editor
would necessarily apply, incorrectly expunging could
(14) Kid could be a brain in school
It appears t h e r e f o r e that the pattern of self-corrections that occur r e p r e s e n t s a potentially rich source of evidence about the nature of syntactic categories
Learnability If the patterns of self-correction count as evidence about the nature of syntactic categories for the linguist, then this data must be equally available to the language learner This would suggest that, far f r o m being
an i m p e d i m e n t to language learning, non-fluencies may in fact facilitate language acquisition bv highlighting equivalent classes
Trang 6This raises the general question of how c h i l d r e n can
acquire a l a n g u a g e in the face of u n r e s t r a i n e d n o n - f l u e n c y
H o w can a l a n g u a g e l e a r n e r sort out the g r a m m a t i c a l f r o m
the u n g r a m m a t i c a l strings? ( T h e n o n - f l u e n c i e s of s p e e c h
are of c o u r s e but one aspect of the d e g e n e r a c y of i n p u t that
m a k e s l a n g u a g e acquisition a puzzle.) T h e self-correction
system I h a v e d e s c r i b e d suggests that m a n y n o n - f l u e n t
strings can be r e s o l v e d with little detailed linguistic
k n o w l e d g e
As T a b l e 1 shows, a b o u t a q u a r t e r of the editing signals
r e s u l t in e x p u n c t i o n of only non-linguistic material This
r e q u i r e s only an ability to distinguish linguistic f r o m non-
linguistic stuff, and it i n t r o d u c e s the idea t h a t edit signals
signal an e x p u n c t i o n site A l m o s t a third are resolved by
the Surface C o p y i n g rule, which can be viewed simply as an
instance of the g e n e r a l non-linguistic rule t h a t multiple
instances of the s a m e t h i n g c o u n t as a single instance The
c a t e g o r y c o p y i n g rules are g e n e r a l i z a t i o n s of simple
c o p y i n g , applied to a k n o w l e d g e of linguistic categories,
M a k i n g the t r a n s i t i o n f r o m surface copies to c a t e g o r y copies
is aided by the fact t h a t t h e r e is c o n s i d e r a b l e o v e r l a p in
c o v e r a g e , d e f i n i n g a path of e x p a n d i n g g e n e r a l i z a t i o n
T h u s at the earliest stages of l e a r n i n g , only the simplest,
non-linguistic self-correction rules would c o m e into play,
and gradually the more syntactically i n t e g r a t e d would be
acquired
C o n t r a s t this self-correction system to an a p p r o a c h that
h a n d l e s n o n - f l u e n c i e s by some general p r o b l e m solving
r o u t i n e s , for e x a m p l e G r a n g e r (1982), who proposes
r e a s o n i n g from what a s p e a k e r m i g h t be e x p e c t e d to say
Besides the o b v i o u s inefficiencies of general p r o b l e m
solving a p p r o a c h e s , it is worth giving special e m p h a s i s to
the p r o b l e m with l e a r n a b i l i t y A general p r o b l e m solving
a p p r o a c h d e p e n d s crucially on e v a l u a t i n g the l i k e l i h o o d of
possible d e v i a t i o n s f r o m the n o r m s But a l a n g u a g e l e a r n e r
has by definition only partial and possibly i n c o r r e c t
k n o w l e d g e of the syntax, and is t h e r e f o r e u n a b l e to
consistently identify d e v i a t i o n s f r o m the g r a m m a t i c a l
system With the editing system I describe, the l e a r n e r n e e d
not h a v e the ability to r e c o g n i z e d e v i a t i o n s f r o m
g r a m m a t i c a l n o r m s , but merely the non-linguistic ability to
r e c o g n i z e copies of the same thing
c o r r e c t i o n c o m p o n e n t f r o m the s t a n d p o i n t of parsing
H o w e v e r , it is clear that the origins are in the process of
g e n e r a t i o n The m e c h a n i s m for editing self-corrections that
I have p r o p o s e d has as its essential o p e r a t i o n e x p u n g i n g one
of two identical elements It is u n a b l e to e x p u n g e a
s e q u e n c e of two e l e m e n t s (The Surface C o p y E d i t o r m i g h t
be viewed as a c o u n t e r e x a m p l e to this claim, but see
below.) C o n s i d e r e x p u n c t i o n now f r o m the s t a n d p o i n t of
the g e n e r a t o r Suppose self-correction bears a o n e - t o - o n e
r e l a t i o n s h i p to a possible action of the g e n e r a t o r (initiated
by some m o n i t o r i n g c o m p o n e n t ) which could be called
A B A N D O N C O N S T R U C T X A n d suppose that this
action can be initiated at any time up until C O N S T R U C T X
is c o m p l e t e d , w h e n a signal is r e t u r n e d that the c o n s t r u c t i o n
is c o m p l e t e F u r t h e r suppose that A B A N D O N
C O N S T R U C T X causes an editing signal W h e n the
s p e a k e r decides in the middle of some linguistic e l e m e n t to
a b a n d o n it and start again, an editing signal is produced
If this is an a p p r o p r i a t e model, then the e l e m e n t s which
are self-corrected should be exactly those e l e m e n t s that
exist at some stage in the g e n e r a t i o n process T h u s , we
s h o u l d be able to find e v i d e n c e for the units i n v o l v e d in
g e n e r a t i o n by l o o k i n g at the data of self-correction A n d
i n d e e d , such e v i d e n c e s h o u l d be a v a i l a b l e to the l a n g u a g e
l e a r n e r as well
Summary
I h a v e d e s c r i b e d the n a t u r e of self-corrected s p e e c h ( w h i c h is a m a j o r s o u r c e of s p o k e n n o n f l u e n c i e s ) and h o w
it can be r e s o l v e d by simple editing rules within the c o n t e x t
of a d e t e r m i n i s t i c parser T w o f e a t u r e s are essential to the
s e l f - c o r r e c t i o n system: I) e v e r y s e l f - c o r r e c t i o n site ( w h e t h e r
it results in the e x p u n c t i o n of words or not) is m a r k e d by a
p h o n e t i c a l l y i d e n t i f i a b l e signal placed at the r i g h t edge of the p o t e n t i a l e x p u n c t i o n site; and 2) the e x p u n g e d part is the l e f t - h a n d m e m b e r of a pair of copies, one on each side
of the editing signal T h e copies may be of t h r e e types: 1) identical surface strings, which are edited by a m a t c h i n g rule t h a t applies b e f o r e syntactic analysis b e g i n s ; 2)
c o m p l e t e c o n s t i t u e n t s , w h e n two c o n s t i t u e n t s of the s a m e type a p p e a r in the p a r s e r ' s b u f f e r ; or 3) i n c o m p l e t e
c o n s t i t u e n t s , w h e n the p a r s e r finds itself trying to c o m p l e t e
a c o n s t i t u e n t of the same type as a c o n s t i t u e n t it has just
c o m p l e t e d W h e n e v e r two such copies a p p e a r in such a
c o n f i g u r a t i o n , and the first one ends with an editing signal, the first is e x p u n g e d f r o m f u r t h e r analysis This editing system has b e e n i m p l e m e n t e d as part of a d e t e r m i n i s t i c parser, and tested on a wide r a n g e of s e n t e n c e s f r o m
t r a n s c r i b e d speech F u r t h e r study of the s e l f - c o r r e c t i o n system p r o m i s e s to p r o v i d e insights into t h e units of
p r o d u c t i o n and the n a t u r e of linguistic categories
Acknowledgements
My t h a n k s to T o n y K r o c h , Mitch Marcus, and Ken
C h u r c h for helpful c o m m e n t s on this work
References
F r o m k i n , Victoria A ed 1980 Errors in Linguistic Performance: Slips of the Tongue Ear Pen and Hand
A c a d e m i c Press: New York
G r a n g e r , R i c h a r d H 1982 Scruffy T e x t U n d e r s t a n d i n g : Design and I m p l e m e n t a t i o n of ' T o l e r a n t ' U n d e r s t a n d e r s
Proceedings of the 20th Annual Meeting of the ACL
H a y e s , Philip I and G e o r g e V M o u r a d i a n 1981 Flexible Parsing American Journal of Computational Linguistics 7.4, 232-242
J'efferson, Gall 1974 E r r o r c o r r e c t i o n as an
i n t e r a c t i o n a l resource Language in Society 2:181-199
Kroch, A n t h o n y and D o n a l d Hindle 1981 A
quantitative study of the syntax of speech and writing Final
r e p o r t to the N a t i o n a l Institute of E d u c a t i o n , g r a n t 78-0169
L a b o r , William 1966 O n the g r a m m a t i c a l i t y of
e v e r y d a y speech P a p e r p r e s e n t e d at the Linguistic Society
of A m e r i c a a n n u a l m e e t i n g Marcus, Mitchell P 1980 A Theory of Syntactic Recognition for Natural Language MIT Press: C a m b r i d g e ,
MA
T h o m p s o n , B o z e n a H 1980 A linguistic analysis of
n a t u r a l l a n g u a g e c o m m u n i c a t i o n with c o m p u t e r s
computational linguistics
128