In this paper, we describe the results of several studies that applied measures of phoneme intell;gibility, word recognition, and comprehension to assess the perception of synthetic sp
Trang 1Perception of Synthetic Speech Generated
I n v i t e d Paper zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
As the use of voice response systems employing synthetic speech
becomes more widespread in consumer products, industrial and
military applications, and aids for the handicapped, it will be
necessary to develop reliable methods of comparing different
synthesis systems and of assessing how human observers perceive
and respond to the speech generated by these systems The selec-
tion of a specific voice response system for a particular application
depends on a wide variety of factors only one of which is the
inherent intelligibility of the speech generated by the synthesis
coutines In this paper, we describe the results of several studies
that applied measures of phoneme intell;gibility, word recognition,
and comprehension to assess the perception of synthetic speech
Several techniques were used to compare performance of different
synthesis systems with natural speech and to learn more about how
humans perceive synthetic speech generated by rule Our findings
suggest that the perception of synthetic speech depends on an
interaction of several factors including the acoustic-phonetic p r o p
erties of the speech signal, the requirements of the perceptual task,
and the previous experience of the listener Differences in percep
tion between natural speech and high-quality synthetic speech
appear to be related to the redundancy of the acoustic-phonetic
information encoded in the speech signal
In the not too distant past, voice output systems could be
classified into two broad categories depending on the na-
ture of the synthesis process Speech-coding systems used a
fixed set of parameters t o reproduce a relatively limited
vocabulary of utterances These systems produced intelligi-
ble and acceptable speech [58], [ 5 9 ] at the cost of flexibility
i n terms of the range of utterances that could be produced
I n contrast! synthetic speech produced by rule provided less
intelligible and less natural sounding speech, but these
systems had the capability of automatically converting unre-
stricted text i n ASCII format into speech [2], [ 3 ] Over the
last few years, significant improvements in text-to-speech
Manuscript received January 15, 1985; revised July 3, 1985 This
research was supported, i n part, under NIH Grant NS-12179 and, in
part, under Contract AF-F 33615-83-K-0501 with the Air Force Sys-
tems Command, AFOSR, through the Aerospace Medical Research
Laboratory, Wright-Patterson AFB, Ohio Requests for reprints
should be sent t o the authors at the address below
The authors are with the Speech Research Laboratory, Depart-
ment of Psychology, Indiana University, Bloomington, IN 47405,
USA
systems have begun coded-speech voice
to eliminate the advantages of simple response systems over text-to-speech systems Extensive research on improving the letter-to-sound rules and phonetic implementation rules used by these systems, as well as the techniques of diphone and demisyl- lable synthesis i n text-to-speech systems suggest that, in the near future, unrestricted text-to-speech voice response de- vices may produce highly intelligible and very natural
sounding synthetic speech [ 2 3 ]
As the quality of the speech generated by text-to-speech systems improves, it becomes necessary to be able to eval- uate and compare the performance of different synthesis systems The need for a systematic and reliable assessment
of the capabilities of voice response devices becomes even more critical as the complexity of these systems increases-this is especially true when considering the ad- vanced features that are now being offered by some of the newest systems such as DECtalk, Prose-2000, and lnfovox which provide capabilities for synthesis of different languages and generation of several different synthetic voices It is also important, in its own right, to learn more about how human listeners perceive and understand syn- thetic speech and how performance with synthetic speech differs from natural speech
If there existed a set of algorithms or a set of acoustic criteria that could be applied automatically to measure the quality of synthetic speech, there would be no question about describing the performance of a particular system or the effectiveness of new rules or synthesis techniques
Standards could be developed fairly easily and applied uniformly Unfortunately, there is no existing method for automating the assessment of synthetic speech quality The ultimate test of synthetic speech involves assessment and perceptual response by the human listener Thus it is neces- sary t o employ perceptual tests of synthetic speech under the conditions in which synthetic speech will be used The perception of speech depends on the human listener as
much as it does on the attributes of the acoustic signal itself and the system of rules used to generate the signal [42]
Although it is clear that the performance of systems that generate synthetic speech must be evaluated using objec-
0018-9219/85/1100-1665$01.00 0 1 9 8 5 IEEE
P R O C E E D I N G S OF zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAT H E IEEE VOL 73, NO 1 1 N O V E M B E R 1965 1665
Trang 2tive perceptual tests, there have been only a handful of
studies dealing with the intelligibility of synthetic speech
over the years (e.g., [19], [38], [MI) And, there have been
even fewer discussions of the technical issues that surround
the problem of measuring the intelligibility of synthetic
speech (see, [35], [ a ] , [42]) Furthermore, it is important to
specify precisely which aspects of synthetic speech are
being evaluated O n one hand, the perception and compre-
hension of synthetic speech can be measured using a variety
of objective behavioral tests that provide precise and statis-
tically reliable estimates of the performance of a particular
voice response system in a specific condition These tests
investigate the transmission of linguistic information from
the speech signal t o the listener and address specific ques-
tions such as: 1) how accurately are synthetic phonemes
and words recognized, 2) how well is the meaning of a
synthetic utterance understood, and 3) how easy is it to
perceive and understand synthetic speech O n the other
hand, an equally important issue concerns the acceptability
and naturalness of synthetic speech, and whether the
listener prefers one type of speech output over another
Questions of listener preference cannot be addressed di-
rectly using objective performance measures such as the
proportion of words correctly recognized or response
latencies, but instead must be investigated more indirectly
by asking the listener for his or her subjective impressions
of the quality of synthetic speech using questions designed
t o assess different dimensions of naturalness, acceptability,
and preference [36]
I n the Speech Research Laboratory at Indiana University,
we have carried out a large number of studies over the last
five years t o learn more about the perception of synthetic
speech generated automatically by rule using several text-
to-speech systems (see [34], [37l, [42], [MI) Strictly speaking,
this work is not human factors research; that is, it is not
designed t o answer specific questions regarding the devel-
opment and use of specific products or techniques Rather,
the goal of our research has been to provide more basic
knowledge about the perception of synthetic speech under
well-defined laboratory conditions These research findings
can then serve as a benchmark for subsequent human
factors studies that may be motivated by more specific
problems of using voice response systems for a particular
application In general, our research has focused on mea-
suring the performance of human listeners who are re-
quired to perceive and respond to synthetic speech under a
variety of task demands and experimental conditions Dur-
ing the course of this work, we have also carried out several
comparisions of the performance of human listeners on
standardized perceptual tasks using synthetic speech gener-
ated by rule with a number of text-to-speech systems zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
11 CONSTRAINTS ON PERFORMANCE OF HUMAN OBSERVERS
To provide a framework for interpreting the results of our
research, we first consider a number of factors that are
known to affect an observer’s performance: 1) the specific
demands imposed by a particular task, 2) the inherent
limitations of the human information processing system, 3)
the experience and training of the human listener, 4) the
linguistic structure of the message set, and 5) the structure
and quality of the speech signal zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
A Task Complexity
The first factor that constrains performance concerns the complexity of the tasks that engage an observer during the perception of speech In some tasks, the response demands are relatively simple, such as deciding which of two known words was said Other tasks are extremely complex, such as
trying to recognize an unknown utterance from a virtually unlimited number of response alternatives, while engaging
in an activity that already requires attention There is a substantial amount of research i n the cognitive psychology and human factors literature demonstrating the powerful effects of perceptual set, instructions, subjective expectan- cies, cognitive load, and response set on performance i n a
variety of perceptual and cognitive tasks [63] The amount
of context and the degree of uncertainty i n the task also strongly affect an observer’s performance in substantial ways [22] Thus it is necessary to understand the require- ments and demands of a particular task before drawing any strong inferences about an observer’s behavior or perfor- mance zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
B Limitations on the Observer
The second factor influencing recognition of synthetic speech concerns the structural limitations on the human information processing system’s ability to perceive, encode, store, and retrieve information Because the nervous system cannot maintain all aspects of sensory stimulation (and therefore must integrate acoustic energy over time), very severe processing limitations have been found in the hu- man observer‘s capacity to encode and store raw sensory data in memory To overcome these capacity limitations, the listener must rapidly transform sensory input into more abstract neural codes for more stable storage i n memory and subsequent processing operations The bulk of the research i n perception and cognitive processes over the last
25 years has identified human short-term memory (STM) as
a major limitation on processing sensory input [50] The amount of information that can be processed i n and out of STM is severely limited by the listener’s attentional state, past experience, and the quality of the original sensory input
C Experience and Training
The third factor concerns the ability of human observers
t o quickly learn effective cognitive and perceptual strate- gies t o improve performance in almost any task When given appropriate feedback and training, subjects can learn
t o classify novel stimuli, remember complex pattern se- quences, and respond to rapidly changing stimulus patterns
i n different sensory modalities Clearly, the flexibility of subjects in adapting to the specific demands of a task is an important constraint that must be considered and con- trolled in any attempt to evaluate the perception of syn- thetic speech by the human observer
D Message Set
The fourth factor relates to the structure of the message set; that is, the constraints on the number of possible messages and the organization and linguistic properties of the message set A message set may consist of words that
1666 PROCEEDINGS O F THE IEEE, VOL 73, NO 11, NOVEMBER zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA1 9 8 5
Trang 3are distinguished only by a single phoneme or may consist
of words and phrases with very different lengths, stress
patterns, and phonotactic structures Use of this constraint
by listeners depends on linguistic knowledge [27] The
choice and arrangement of speech sounds into words is
constrained by the phonological rules of language; the
arrangement of words in sentences is constrained by syntax;
and finally, the meaning of individual words and the overall
meaning of sentences in a text is constrained by semantics
and pragmatics of language The contribution of these vari-
ous levels of linguistic structure to perception will vary
substantially from isolated words, to sentences, to passages
of fluent continuous speech zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
E Signal Characteristics
The fifth factor refers to the acoustic-phonetic and pro-
sodic structure of a synthetic utterance This constraint
refers to the veridicality of the acoustic properties of the
synthetic speech signal compared to naturally produced
speech Speech signals may be thought of as the physical
realization of a complex and hierarchically organized sys-
tem of linguistic rules that map sounds onto meanings and
meanings back onto sounds At the lowest level in the
system, the distinctive properties of the speech signal are
constrained i n substantial ways by vocal tract acoustics and
articulation The acoustic-phonetic structure of natural
speech reflects these physical and contextual constraints;
synthetic speech is an impoverished signal representing
phonetic distinctions with only a limited subset of the
acoustic properties used t o convey phonetic information in
natural speech Furthermore, the acoustic properties used
t o represent segmental structure in synthetic speech are
highly stylized and are insensitive to phonetic context when
compared t o natural speech
There are basically three areas in which a text-to-speech
system could produce errors that would impact the overall
intelligibility of the speech: 1) the spelling-to-sound rules,
2) the computation and production of suprasegmental in-
formation, and 3) the phonetic implementation rules that
convert the internal representation of phonemes and/or
allophones into a speech waveform [2], [4] In our previous
research, we have found that phonetic implementation
rules are a major factor in determining the segmental intel-
ligibility of a voice response system [33] In the perceptual
studies described below, we have focused most of our
attention on measures of segmental intelligibility, assuming
that the letter-to-sound rules used by a particular text-to-
speech system were applied correctly
A Phoneme Intelligibility
The task that has been used most often in previous
studies evaluating synthetic speech and is now accepted as
the de facto standard measure of the segmental intelligibil-
ity of synthetic speech is the Modified Rhyme Test ([14],
[38]; however, see [ a ] for a different opinion) In the
Modified Rhyme Test (MRT), subjects are required to iden-
tify a single word by choosing one of six alternative re-
sponse words differing by a single phoneme in either initial
or final position [18] All the stimuli in the MRT are conso- nant-vowel-consonant zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA(CVC) monosyllabic words; on half
the trials, the responses share the vowel-consonant portion
of the stimulus and on the other half, the responses share the consonant-vowel portion Thus the MRT provides a measure of the performance of listeners in identifying either the initial or final phoneme of a set of spoken words
To date, we have evaluated natural speech and synthetic speech produced by five different text-to-speech systems:
the Vortrax Type-'N'-Talk, the Speech Plus Prose-2000, the MITalk-79 research system, Infovox, and DECtalk (see [15])
The major findings are summarized in Table 1 Performance
Table 1 Percent Correct Performance Obtained for Modified Rhyme Test (MRT) Experiments Conducted at the Speech Research Laboratory (1979-1984)
System Tested MRT MRT (date) Closed Open Natural Speech*
MITalk-79 Research System*
Prototype Prose-2000**
Votrax Type-'N'-Talk***
DECtalk Paul v1.8' DECtalk Betty v1.8++
Current Working Prose+++
Prose-2000 v3.0 lnfovox
(1 1 /79) (6/79-9/79) (1 2/79) (3/82)
( 3 / W
(8/W
( 9 / W zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
(3/85) (3/85)
99.4 97.2
93 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA.I 75.4 87.6
66.2 96.7 94.4
94.3 87.4
86.7 82.5
'Pisoni and Hunnicutt, [MI
**Bernstein and Pisoni, [5] zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
***Pisoni and Koen, [45]
+Creene, Manous, and Pisoni, [15]
+++Manous, Creene, and Pisoni, 1W
++ Creene, Manous, and Pisoni, 1964 (final report)
i n the MRT for natural speech was the best at 99.4 percent correct For DECtalk v1.8, we evaluated speech produced by
"Paul" and "Betty," two of DECtalk's nine voices, and found different levels of performance on these voices 96.7 percent of the words spoken by the Paul voice were identi- fied correctly while only 94.4 percent of Betty's words were identified correctly The level of performance observed for the Paul voice comes the closest to natural speech and is considerably higher than performance for any of the other text-to-speech systems we have studied
Performance on MITalk-produced speech was somewhat lower than either of the DECtalk v1.8 voices at 93.1 percent correct word identification The prototype of the Prose-2000 produced speech that was identified at 87.6 percent correct;
version 3.0 of the Prose-2000 has improved with perfor- mance at 94.3 percent correct The lnfovox multilingual system produced English speech that was identified at 87.4 percent correct Finally, the least intelligible synthetic speech was produced by the Votrax Type-'N'-Talk with only 67.2 percent correct identification These results, obtained under closely matched testing conditions in the same laboratory environment, show a wide range of variation
PlSONl zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAet d l : PERCEPTION OF SYNTHETIC SPEECH GENERATED BY RULE 1667
Trang 4among currently available text-to-speech systems In our
view, these differences in performance directly reflect the
amount of basic research that was carried out to develop
the phonetic implementation rules of these different voice
response systems
In addition to the standard closed-response MRT, we
have also used an open-response format version of the
MRT In this procedure, listeners are instructed to write
down the word that they heard on each trial This open-
response format provides a measure of performance when
constraints on the response set are minimized (all CVC
words known to the listener compared to the six alternative
responses i n the closed-response version) This procedure
also provides information about the intelligibility of vowels
that is not available in the closed-response set version of
the MRT A comparison of the closed- and open-response
versions of the MRT for synthetic speech produced by
different text-to-speech systems with natural speech indi-
cates the degree to which listeners rely on response-set
constraints Some representative findings using the open-
response MRT format are also shown in Table 1
Performance on the open-response set MRT for natural
speech was at 97.2 percent correct exact word identification
compared t o 99.4 percent correct in the closed-response set
task Even when there are no strong constraints on the
number of alternative responses for natural speech, perfor-
mance is still better than for any of the text-to-speech
systems with zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAa constrained set of responses For the MITalk-
79 research system, performance in the open-set MRT task
is, however, considerably worse at 75.4 percent correct
Similarly, DECtalk's Paul voice was identified at the 86.7
percent level; correct word identification for "Betty" was
82.5 percent correct These results show a large interaction
between intelligibility measured in the closed-response for-
mat MRT and the open-response format MRT Although the
rank ordering of intelligibility remains the same across the
t w o forms of the MRT, it is clear that as speech becomes
less intelligible, listeners rely more heavily on response-set
constraints t o aid performance zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
8 Word Recognition in Sentences
To examine the contribution of several linguistic con-
straints on performance, we compared word recognition in
t w o types of sentence contexts The first type of sentence
context was syntactically correct and meaningful-the
Harvard psychoacoustic sentences [13] An example is given
i n (1) below:
Add salt before you fry the egg zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA(1)
The second type of sentence context was syntactically
correct, but these sentences were semantically anomalous
-the Haskins syntactic sentences [38] These test sentences
had the syntactic form of normal sentences, but they were
nonsense An example of this type of nonsense sentence is
given in (2) below:
The old farm cost the blood zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA( 2 )
By comparing word recognition performance for these
t w o classes of sentences, it was possible to determine the
influence of sentence meaning and linguistic constraints on
word recognition [IS] Table 2 shows percent correct word
identification for meaningful and semantically anomalous
Table 2 Percent Correct Word Recognition for Meaningful and Semantically Anomalous Sentence Contexts
Type of Sentence Context Type of Speech Meaningful (%) Anomalous (%) Natural 99.2 97.7 MlTalk-79 93.3 78.7 Prototype Prose-2000 83.7 64.5
DEC Paul VI .8 95.3 86.8
DEC Betty VI .8 90.5 75.1 Current Working Prose (9/84) 9 l .O zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
sentences for natural speech, and synthetic speech pro- duced by MlTalk-79, the Speech Plus Prose-2000 prototype, and for DECtalk's Paul and Betty voices (v1.8) For natural and synthetic speech, word recognition was much better in meaningful sentences than in the semantically anomalous sentences Furthermore, a comparison of correct word iden- tification in these sentences reveals an interaction in perfor- mance suggesting that semantic constraints are relied on by listeners much more when the speech becomes progres- sively less intelligible zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
C Listening Comprehension
Spoken language understanding is a very complex cognitive process that involves the encoding of sensory information, retrieval of previously stored knowledge from long-term memory, and the subsequent interpretation and integration of various sources of knowledge available to a
listener [26], [39] Language comprehension, therefore, de- pends on a relatively large number of diverse and complex factors, many of which are still only poorly understood by cognitive psychologists at the present time Measuring com- prehension is difficult, therefore, because of the interaction
of several different knowledge sources in the comprehen- sion process This problem is made worse because there is
no coherent theoretical model of language comprehension
to guide the development of measurement procedures
Moreover, there are presently no theoretical models that can deal with the diverse strategies employed by listeners
to mediate language understanding under a wide variety of listening conditions and task demands
One of the factors that obviously plays an important role
in listening comprehension is the quality of the initial input signal-that is, the intelligibility of the speech itself But the acoustic-phonetic properties of the input signal are only one source of information used by listeners in speech perception and spoken language understanding As we have seen from the results summarized i n the previous sections, additional consideration must also be given to the contribu- tion of higher levels of linguistic knowledge to perception and comprehension
In our initial attempts to measure comprehension of synthetic speech, we wanted to obtain a gross estimate of
h o w well listeners could understand the linguistic content
of continuous, fluent speech produced by the MITalk-79 text-to-speech system (see [ a ] , [MI) As far as we have been able t o determine, little attention has actually been devoted
t o the problems surrounding comprehension of the linguis- tic content of synthetically produced speech, particularly passages of meaningful fluent continuous speech [21], [ a ]
To assess comprehension, we selected fifteen narrative
1668 PROCEEDINGS O F THE zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAIEEE VOL 73, NO 11, NOVEMBER 19W
Trang 5passages and an appropriate set of multiple-choice test
questions from several standardized adult reading compre-
hension tests [IO], [20], [30], [55] The passages were quite
diverse, covering a wide range of topics, writing styles, and
vocabulary These passages were also selected to be inter-
esting for subjects to listen to in the context of laboratory-
based tests designed to assess language understanding
Since these test passages were chosen from several differ-
ent types of reading tests, they varied in difficulty and style
This variation permitted us to evaluate the contribution of
all of the individual components of a particular text-to-
speech system to comprehension in one relatively gross
measure We assumed that the results of these comprehen-
sion tests would provide an initial benchmark against which
the entire text-to-speech system could be evaluated with
materials that would be comparable to those used i n a
more realistic application such as a reading machine for the
blind or database information retrieval system [I], [2]
I n our initial study, we tested three groups of naive
subjects with 20 subjects i n each group zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI One group of
subjects listened to MITalk-79 versions of the passages,
another listened to natural speech, while a third group read
the passages silently All three groups answered the same
set of test questions immediately after each passage In a
subsequent study [5], a group of subjects listened to the
prototype of the Speech Plus Prose-2000 (then known as
Telesensory Systems, Incorporated) The same prose pas-
sages were used in this study as in the original study
The comprehension results for all groups are summarized
in Table 3 Averaged over the last thirteen test passages, the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Table 3 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAPercent Correct Performance on the
Comprehension Tests [5], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA [a], [44]
1 s t Half 2nd Half Total (6 passages) (6 passages) (1 3 passages) (%) (X) (%I
MITalk-79 64.1 74.8 70.3
Natural Speech 65.6 68.5 67.8
Prose-2000 (TSI) 60.9 67.3 65.2
Reading 76.1 77.2 77.2
reading group showed a significant 7 percent advantage zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
( p < 0.a) over the synthetic speech group and a 12 percent
advantage over the Prose (TSI) group ( p zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA0.OOl) However,
the differences i n performance between the groups ap-
peared to be localized primarily i n the first half of the test
By the second half, performance for the groups listening t o
synthetic speech improved substantially whereas perfor-
mance for the reading group remained about the same
Although the scores for the natural speech group were
slightly lower overall, no improvement was observed in
their performance from the first half to the second half of
the test
The finding of improved performance in the second half
of the test for subjects listening to synthetic speech is
consistent with the earlier results from word recognition i n
sentences which showed that recognition performance im-
proves for synthetic speech after only a short period of
exposure These results suggest that the overall difference
in performance between the groups is probably due to
familiarity with the output of the synthesizer and not due
t o any inherent difference in the basic strategies used i n
comprehending or understanding the content of these pas- sages
D Conclusions: Intelligibility and Comprehension
The results of the Modified Rhyme Test revealed rela- tively high levels of segmental intelligibility for speech generated by MITalk-79, Prose-2000, Infovox, and DECtalk
The results for the Votrax Type-’N’-Talk using this measure showed much lower levels of performance The progres- sion from MITalk-77 (the forerunner of Prose-2000), to the MITalk-79 research system, to Infovox, to the Prose-2000, and finally t o DECtalk shows the continual refinement of speech synthesis technology With additional research and development, the speech generated by these high-quality text-to-speech systems may soon approach the almost per-
fect levels of intelligibility observed for natural speech under laboratory testing conditions
The results from the two sentence tasks indicated that context is a powerful aid to recognition When both semantic and syntactic information is available to subjects, higher levels of performance were obtained in recognizing words i n sentences However, when the use of semantic knowledge was modified or eliminated, as in the Haskins sentences, subjects must, of necessity, rely primarily on the acoustic-phonetic information in the signal and their knowledge of morphology Clearly, the contribution of higher level sources of knowledge is responsible for the superior performance obtained on Harvard sentences; in the absence of this knowledge, subjects’ performance was considerably poorer
Finally, the results of the listening comprehension tests
reveal that subjects are able to correctly answer multiple- choice comprehension questions about the content of pas- sages of fluent connected synthetic speech After only a few minutes of exposure to the output of a speech syn- thesizer, comprehension improves substantially and eventu- ally approximates levels observed when subjects read the same passages of text or listen to naturally produced ver- sions of the same materials There are, however, a number
of problems in measuring comprehension with the materi- als we have used First, these materials were designed to measure reading comprehension, not listening comprehen- sion Thus for these tests, a reader was expected to re-read the material in order to answer some of the questions The reader always has access to the passage; the listener cannot
go back and hear some portion of the passage again
Second, the multiple-choice questions do not directly assess the perceptual processes used to encode the speech input
Moreover, these questions measure comprehension after the materials have been presented therefore reflecting post-perceptual comprehension strategies and subject bi- ases Thus multiple-choice questions are not measures of the on-line, real-time cognitive processes used in compre- hension but reflect the final product of comprehension
Iv PERCEPTUAL ENCODING OF SYNTHETIC SPEECH
The results of the MRT and word identification studies of natural and synthetic speech clearly indicate that synthetic speech is less intelligible than natural speech In addition, these studies demonstrate that, as synthetic speech be- comes less intelligible, listeners rely more on linguistic
PISONI zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAet a/: PERCEPTION OF SYNTHETIC SPEECH GENERATED BY R U L E 1669
Trang 6knowledge and response-set constraints to aid word identi-
fication However, these studies do not account for the
differences in perception between natural and synthetic
speech; rather they just demonstrate and describe some of
these basic differences zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
A Lexical Decision and Naming Latencies
In order t o begin to investigate differences i n the percep-
tual processing of natural and synthetic speech, we carried
out a series of experiments designed to measure the time
listeners need to recognize words and pronounceable non-
words produced by a human talker and a text-to-speech
system In carrying out these studies, we wanted to know
how long it takes a listener to recognize an isolated word,
and h o w the process of word recognition might be affected
by the quality of the acoustic-phonetic information in the
signal To measure the duration of the recognition process,
we used a lexical decision task zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[41], [54] Listeners were presented with a single word or a nonword stimulus item
on each trial Each listener was required to classify the item
as either a “word” or a “nonword” as quickly and accu-
rately as possible by pressing one of two buttons located on
a response box that was interfaced to a minicomputer
Examples of the stimuli are shown in Table 4
Table 4 Examples of Lexical
Decision Stimuli
PROMINENT PRADAMENT
BAKED BEPT
TINY TADGY
GLASS CEEP
PARENTS PEEMERS
TOLD TAVED
BLACK BAEP
CONCERT CAELIMPS
DARK D U T
BABBLE BURTLE
CRITIC CRAENICK
BOUGHT BUPPED
PAIN POON
GORGEOUS GAETLESS
COLORED COOBERED
Subjects responded significantly faster to natural words zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
sponse times t o the synthetic speech were 145 ms longer
than response times to the natural speech These findings
demonstrate two important differences i n perception be-
tween natural and synthetic speech First, perception of
synthetic speech requires more cognitive “effort” than the
perception of natural speech Second, because the dif-
ferences i n latency were observed for both words and
nonwords alike, and therefore do not depend on the
lexical status of the test item, the extra processing effort
appears t o be related t o the process of extracting the
acoustic-phonetic information from the signal and not the
process of identifying words in the lexicon In short, the
pattern of results suggests that the perceptual processes
used t o encode synthetic speech require more cognitive
“effort“ or resources than the processes used to encode
natural speech
Similar results were obtained by Pisoni [42] in a naming
task using natural and synthetic words and nonwords As i n
the lexical decision experiment, subjects were much slower
to name synthetic test items than natural test items More- over, this difference was again observed for both words and nonwords The naming results demonstrate that the extra processing time needed for synthetic speech does not de- pend on the type of response made by the listener, since the results were comparable for both manual and vocal responses Taken together, these two sets of findings dem- onstrate that early stages of encoding synthetic speech require more processing time than encoding natural speech
Several additional studies were carried out to determine the nature and extent of the encoding differences between natural and synthetic speech
6 Consonant- Vowel (CV) Confusions
Several hypotheses can be proposed to account for the greater difficulty of encoding synthetic speech One hy- pothesis that has been suggested recently is that synthetic
speech is simply equivalent to “noisy” natural speech [8],
[56] That is, the acoustic-phonetic structure of synthetic speech is more difficult to encode than natural speech for the same reasons that natural speech presented in noise i s
hard t o perceive-the acoustic cues to phonemes are ob- scured, masked, or physically degraded in some way by the masking noise According to this view, synthetic speech is
on the same continuum as natural speech, but it is de- graded in comparison with natural speech In contrast, an alternative hypothesis, and the one we prefer, is that syn- thetic speech is not like “noisy” or degraded natural speech
at all, but instead may be thought of as ”perceptually impoverished” relative to natural speech By this account, synthetic speech is fundamentally different from natural speech in both degree and kind because many of the important criteria1 acoustic cues are either poorly repre- sented or not represented at all
Spoken language is structurally rich and redundant at all levels of linguistic analysis In particular, natural speech is
highly redundant at the level of acoustic-phonetic struc- ture Natural speech contains multiple acoustic cues for almost every phonetic distinction and these cues change as
a function of context, speaking rate, and talker However, in synthesizing speech by rule, only a small subset of the possible cues are typically implemented as phonetic imple- mentation rules As a result, some phonetic distinctions may be minimally cued, perhaps by only a single acoustic attribute If all cues do not have equal importance i n different phonetic contexts, a single cue may not be per- ceptually sufficient t o convey a particular phonetic distinc-
tion in all utterances (see [12]) Moreover, the reliance on
minimal sets of cues in generating synthetic speech could
be disastrous for perception if a particular phonetic distinc- tion i s incorrectly synthesized or masked by environmental noise Indeed, many of the errors we have found in our analyses of the M R T data suggest that this account is correct These two hypotheses concerning the structural relation- ships between synthetic and natural speech make different predictions about the types of errors and the distribution of perceptual confusions that should be observed with syn- thetic speech compared to natural speech According to the
“noisy speech” hypothesis, synthetic speech is similar to natural speech that has been degraded by the addition of noise Therefore, the perceptual confusions that occur with
~ 5 1
Trang 7synthetic speech should be very similar to those obtained
with natural speech heard in noise By comparison, the
“impoverished speech” hypothesis claims that the acous-
tic-phonetic structure of synthetic speech is not as rich or
redundant i n segmental cues as natural speech According
t o this hypothesis, two patterns of confusion errors should
occur i n the perception of synthetic speech When the
acoustic cues used to specify a phonetic segment are not
sufficiently distinctive, confusions should occur between
minimally cued segments that are phonetically similar This
error pattern should be similar to the errors predicted by
the noisy speech hypothesis, since perceptual confusions of
natural speech i n noise also depend on the acoustic-
phonetic similarity of the segments [29], [62] However, the
t w o hypotheses may be distinguished by the presence of a
second type of error that is only predicted by the im-
poverished speech hypothesis If the minimal acoustic cues
used to specify phonetic contrasts are incorrect or mislead-
ing as a result of poorly specified phonetic implementation
rules, then confusions should occur that are not based on
the nominal acoustic-phonetic similarity of the confused
segments Instead, these confusions should be entirely
determined by the listener’s perceptual interpretation of
the misleading cues Thus the pattern of confusions
observed with synthetic speech should be phonetically quite
different from the expected ones based on the acoustic-
phonetic similarity of natural speech
To investigate the predictions made by these two hy-
potheses, Nusbaum, Dedina and Pisoni [32] examined the
perceptual confusions that arise within a set of 48 natural
and synthetic consonant-vowel (CV) syllables as stimuli
These were constructed from the vowels /i, a, u/ and the
consonants /b, d, g, p, t, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAk, n, m, r, I, w, j, s, f, z, v/ The
natural CV syllables were produced by a male talker The
synthetic syllables were generated by three different text-
to-speech systems-the Votrax Type-IN’-Talk, the Speech
Plus Prose-2000 v2.1, and the Digital Equipment Corpora-
tion DECtalk v1.8 To assess the pattern of perceptual con-
fusions that occur for natural speech, the natural syllables
were presented to listeners at four signal-to-noise zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA( S / N )
ratios of +28, 0, - 5 , and -10 dB
When averaged over the three vowel contexts, the results
showed that natural speech at +28 dB S/N was the most
intelligible (96.6 percent correct), followed by DECtalk (92.2
percent correct), followed by the Prose-2000 (62.8 percent
correct) The Type-IN’-Talk showed the worst performance
(27.0 percent correct) Of special interest were the results of
more detailed error analyses which revealed that the distri-
butions of perceptual confusions obtained for natural and
synthetic speech were often quite different For example, in
the case of DECtalk, 100 percent of the errors made in
identifying the segment /r/ were due to confusions with
/b/ even though this type of error never occurred for
natural speech at +28 dB S/N Even at the poorest S / N
(-10 dB) where the intelligibility of natural speech in noise
was actually worse than DECtalk presented without noise
(29.1 percent correct versus 92.2 percent correct), this type
of error accounted for only 3 percent of the total errors
observed for this segment
In order t o examine the segmental confusions more pre-
cisely, we compared the confusion matrices for a particular
text-to-speech system with the confusion matrices for natu-
ral speech presented at a signal-to-noise ratio that resulted
i n comparable overall levels of identification performance
We compared the confusion matrices for the Prose-2000 with natural speech presented at 0 dB S / N and the confu- sion matrices for Votrax with natural speech presented at -10 dB S/N An examination of the proportion of the total
errors contributed by each response class (stop, nasal, liquid/glide, fricative, other) indicated that, for natural speech, most of the errors in identifying stops were due to responses that were other stop consonants In contrast, the errors found with the Prose-2000 appeared to be more evenly distributed between stop, liquid/glide, and fricative responses I n other words, more intrusions appeared from other manner classes in the errors observed with the Prose-
2000 synthetic speech than for the natural speech produced
in noise Thus the different pattern of errors obtained for Prose-2000 and natural speech suggests that the errors pro- duced by the Prose-2000 may be “phonetic miscues” rather than true phonetic confusions
The comparison between natural speech at -10 dB S/N
and Votrax speech indicated that the pattern of errors in identifying stops was more similar for these conditions
Indeed, the comparison of identification errors for natural
speech at 0 dB and -10 dB S/N was quite similar to the
comparison between Votrax and natural speech Thus at least for the perception of stop consonants, the confusions
of Votrax speech seem to be based on the acoustic-phonetic similarity of the confused segments as in noisy speech
However, it should be emphasized that the overall perfor- mance level for Votrax synthetic speech was quite low to begin with Therefore, these errors could reflect similarities that occur when performance begins to approach chance levels
A very different pattern of results was obtained for the errors that occurred in the perception of liquids and glides
The distribution of errors for Prose-2000 speech and natural speech revealed that similar confusions were made for liquids and glides for both types of speech However, the results were quite different for the errors made with Votrax speech and natural speech for these phonemes For liquids and glides, the largest number of errors for Votrax speech resulted from confusions with stop consonants while for natural speech, relatively few stop responses were ob- served Thus for liquids and glides, errors in perception of Prose-2000 speech seem to be based on acoustic-phonetic similarity while the errors for Votrax speech seem to be phonetic miscues
In summary, based on these confusion analyses, it should
be obvious that the predictions made by the noisy speech hypothesis are simply incorrect Two different types of errors were observed in the perception of synthetic speech
Some consonant identification errors were based on the acoustic-phonetic similarity of the confused segments
Other errors follow a pattern that can only be explained as phonetic miscues; these are errors in which the acoustic cues used i n synthesis specified the wrong segment in a
particular context zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
C Gating and Signal Duration
The results of the consonant-vowel confusion experi-
ment support the conclusion that the differences in percep- tion between natural and synthetic speech are largely the result of differences in the acoustic-phonetic properties of
PlSONl et d l : PERCEPTION O F SYNTHETIC SPEECH GENERATED BY R U L E 1671
Trang 8the signals More recently, we have found further support
for this account using the gating paradigm [16], [47] to
investigate the perception of natural and synthetic words
In an experiment carried out recently by Manous and Pisoni
[25] listeners were presented with short segments of either
natural or synthetic words for identification O n the first
trial using a particular word, the first 50 ms of the signal was
presented for identification O n subsequent trials, the
amount of signal duration was increased in 50-ms steps so
that on the next trial, 100 ms of the word was identified,
and on the next trial 150 ms of the word was identified, and
so on, until the entire word was presented Manous and
Pisoni found that, on the average, natural words could be
identified after 67 percent of a word was heard; for syn-
thetic words, it was necessary for listeners to hear 75
percent of a word for correct word identification These
gating results demonstrate more directly that the acoustic-
phonetic structure of synthetic words conveys less informa-
tion (per unit of time) than the acoustic-phonetic structure
of natural speech zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
0 Conclusions: Perceptual Encoding zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Taken together, our results provide strong evidence that
encoding of the acoustic-phonetic structure of synthetic
speech i s more difficult and requires more cognitive effort
and capacity than encoding natural speech One source
of support for this conclusion comes from the finding
that recognition of words and nonwords requires more
processing time for synthetic speech compared to natural
speech This result indicates that a major source of difficulty
in recognition is the extraction of phonetic information and
not word recognition since the same result was obtained
for both words and nonwords This conclusion is supported
further by the findings of the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBACV confusion study This
experiment demonstrated significant differences i n percep-
tion of the acoustic-phonetic structure of synthetic and
natural speech Synthetic speech may be viewed as a
phonetically impoverished signal compared to natural
speech This was demonstrated clearly i n the gating experi-
ment using natural and synthetic speech The results ob-
tained from this experiment suggest that synthetic speech
requires more acoustic-phonetic information to correctly
identify isolated monosyllabic words
Taken together, the overall pattern of findings suggests
that the differences i n processing time between natural and
synthetic speech probably lie at processing stages involved
in the extraction of basic acoustic-phonetic information
from the speech waveform-that is, the early pattern recog-
nition process itself rather than at the more cognitive levels
involved i n lexical search or retrieval of words from the
mental lexicon (see zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[43n
The results obtained in the lexical decision and naming
tasks also demonstrate that even with relatively high levels
of performance accuracy, synthetic speech requires more
cognitive processing time than natural speech to recognize
words presented in isolation In these studies, however,
subjects were performing relatively simple and straightfor-
ward tasks As we noted at the outset, the specific task
demands of a perceptual experiment almost always affect
the speed and accuracy of a listener’s response The next
series of experiments we will describe was designed t o
impose additional cognitive demands on subjects besides those already incurred by the acoustic-phonetic properties
of synthetic speech
Recent work on human selective attention has suggested that cognitive processes are limited by the capacity of short-term or working memory [SI] Thus any perceptual process that imposes a load on short-term memory may interfere with decision making, perceptual processing, and other subsequent cognitive operations If perception of synthetic speech imposes a greater demand on the capacity
of short-term memory than perception of natural speech, then the use of synthetic speech in applications where other complex cognitive operations are required might pro- duce serious problems in recognition of the message
Several years ago, Luce, Feustel, and Pisoni [24] con- ducted a series of experiments that were designed to de- termine the effects of processing synthetic speech on short-term memory capacity In one experiment, on each trial, subjects were given two different lists of items to
remember The first list consisted of a set of digits visually
presented o n a CRT screen On some trials, no digits were
presented On other trials, either three or six digits were presented i n the visual display Following the visual list, subjects were presented with a spoken list of ten natural words or ten synthetic words After the spoken list was presented, the subjects were instructed to write down all the visual digits in the order of presentation and all the words they could remember from the auditory list
Across all three visual conditions (no list, three, or six
digits), recall of the natural words was significantly better than recall of the synthetic words In addition, recall of the synthetic and natural words became worse as the size of the digit lists increased In other words, increasing the number of digits held in short-term memory impaired recall
of the spoken words But the most important finding was the interaction between the type of speech presented (syn- thetic versus natural) and the number of digits presented (three versus six) This interaction was revealed by the number of subjects who could recall all the digits pre-
sented i n correct order As the size of the digit lists in-
creased, significantly fewer subjects were able to recall all the digits for the synthetic words compared to the natural words Thus perception of the synthetic speech impaired recall of the visually presented digits more with increasing digit list size than did natural speech These results demon- strate that synthetic speech requires more short-term mem- ory capacity than natural speech As a result, it would be expected that synthetic speech should interfere much more with other cognitive processes because it imposes greater capacity demands on the human information processing system than natural speech
To test this prediction, Luce et al [24] carried out another experiment in which subjects were presented with lists of ten words to be memorized The lists were either all syn- thetic or all natural words The subjects were required to recall the words in the same order as the original presenta- tion As in the previous experiment, the natural words were recalled better overall than the synthetic words However, a more detailed analysis revealed an interaction in recall
1672 PROCEEDINGS O F THE IEEE, VOL 7 3 , NO zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA11, NOVEMBER 1985
Trang 9performance depending on the position of items in the list
The first synthetic words heard in the list were recalled
much less accurately than the natural words in the begin-
ning of the lists This result demonstrated that, i n the
synthetic lists, the words heard later in each list interfered
with active rehearsal of the words heard earlier in the list
This is precisely the result that would be expected if the
perceptual encoding of the synthetic words placed greater
processing demands on short-term memory zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[46]
The data o n serial-ordered recall of lists of natural and
synthetic speech support the conclusion from the lexical
decision research that the processing of synthetic speech
requires more effort than perception of natural speech The
perceptual encoding of synthetic speech requires more
cognitive capacity and may, i n turn, affect other cognitive
processes that require active attentional resources Previous
research on capacity limitations in speech perception dem-
onstrated that paying attention to one spoken message
seriously impairs the listener’s ability to detect specific
words i n other spoken messages (e.g., zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[6], [57]) Moreover,
several recent experiments have shown that attending to
one message significantly impairs phoneme recognition in a
second stream of speech [31] Taken together, these studies
indicate that speech perception requires active attention
and cognitive capacity, even at the level of encoding
phonemes As a result, increased processing demands for
encoding synthetic speech may place important perceptual
and cognitive limitations on the use of voice response
systems in high information load conditions or severe en-
vironments This would be especially true in cases where a
listener is expected to pay attention to several different
sources of information at the same time
VI TRAINING A N D EXPERIENCE WITH SYNTHETIC SPEECH
The human observer is a very flexibile processor of infor-
mation With sufficient experience, practice, and special-
ized training, observers are able to overcome some of the
limitations on performance we have observed in our previ-
ous studies indeed, several researchers (e.g., [7], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI) have
reported a rapid improvement in recognition of synthetic
speech during the course of their experiments These im-
provements appear to be the result of subjects learning to
process the acoustic-phonetic structure of synthetic speech
more effectively However, it is also possible that the ;e-
ported improvements in intelligibility of synthetic speech
were simply due to an increased familiarity with the experi-
mental procedures rather than a real improvement in the
perceptual processing of the synthetic speech In order to
test these alternatives, Schwab, Nusbaum, and Pisoni [49]
carried out an experiment to separate the effects of training
o n task performance from improvements in the recognition
of synthetic speech
Three groups of subjects were given a pre-test with
synthetic speech on Day 1 and a post-test with synthetic
speech on Day 10 of the experiment The pre-test estab-
lished baseline performance for the Votrax Type-’N‘-Talk
text-to-speech system; the post-test on Day 10 was used zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAto
determine i f any improvements had occurred in recognition
of the synthetic speech after training The low-cost Votrax
system was used primarily because of the poor quality of i t s
segmental synthesis Thus ceiling effects i n performance
would not obscure any effects of training and there would
be room for improvement to occur during the course of the experiment
The three groups of subjects were treated differently on Days 2-9 One group received training with Votrax syn- thetic speech One group was trained with natural speech using the same words, sentences, and paragraphs as the group trained on synthetic speech This second group served
t o control for familiarity with the specific experimental tasks Finally, a third group received no training at all on Days 2-9
O n the pre-test and post-test days, the subjects were given the MRT, isolated phonetically balanced (PB) words, and sentences for transcription The word lists were taken from PB lists; the sentences consisted of both meaningful and semantically anomalous sentences used in our earlier work Subjects were given different materials to listen to on every day of the experiment During all the training sessions (i.e., Days 2-9), subjects were presented with spoken words and sentences, and received feedback indicating the iden- tity of the stimulus presented on each trial
The results showed that performance improved dramati- cally for only one group-the subjects that were trained with the Votrax synthetic speech At the end of training, the Votrax-trained group showed significantly higher levels of performance than either of the other two groups To take one example, performance in identifying isolated PB words improved for the Votrax-trained group from about 25 per- cent correct on the pre-test to almost 70-percent correct word recognition on the post-test Similar improvements were found for all the word identification tasks
The results of this training study suggest several im- portant conclusions First, the effects of training appear to
be related t o improving or modifying the encoding process used t o recognize words Clearly, subjects were not simply learning to perform the various tasks better, since the sub- jects trained on natural speech showed little or no improve- ment in performance Moreover, training affected perfor- mance similarly with isolated words and words in sentences, and for closed- and open-response sets The pattern of results strongly suggests that subjects in the group trained
o n synthetic speech were not memorizing individual test items nor were they learning special strategies; that is, they did not learn t o use linguistic knowledge or task constraints
t o improve their recognition performance Rather, subjects learned something about the structural characteristics of this particular text-to-speech system that enabled them to perform better regardless of the task This conclusion is further supported by the design of the training study Im- provements in performance were obtained with novel materials even though the subjects never heard the same words or sentences more than once during the entire ex- periment In order to show improvements in performance, subjects must have acquired detailed information and knowledge about the rule system used to generate the synthetic speech They could not have shown improve- ments i n performance on the post-test if they simply learned
to memorize individual words or sentences since novel materials were used in this test too
i n addition to these findings, we also found that subjects retained the training even after six months with no further contact with the synthetic speech Thus it appears that
PlSONl et d l : zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAPERCEPTION zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAOF SYNTHETIC SPEECH GENERATED BY R U L E 1673
Trang 10training produced a relatively stable and long-term change
in the perceptual encoding processes used by subjects
Furthermore, it is likely that more extensive training would
have produced even greater persistence of the training
effects If subjects had been trained to asymptotic levels of
performance, the long-term effects of training might have
been even more stable The results of this study demon-
strate that human listeners can modify their perceptual
strategies i n encoding synthetic speech and that substantial
increases in performance can be realized in relatively short
periods of time even with poor-quality synthetic speech
A Research on Comprehension
Most of the research on synthetic speech produced by
text-to-speech systems has, in the past, been concerned
with the acoustic-phonetic output generated by these sys-
tems (see, however, [21], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI, [49]) Researchers have focused
attention o n improving the segmental intelligibility of syn-
thetic speech At this point in time, the available perceptual
data suggest that segmental intelligibility is quite good for
some systems (DECtalk, Prose, Infovox) and, while not at
the same level as natural speech, it may take a great deal of
additional effort to achieve relatively small gains in im-
provement O n the other hand, little research effort has
been directed towards assessing.listening comprehension in
a more general sense Our initial efforts used relatively
gross and insensitive measures of comprehension, even
though these measures revealed small though reliable dif-
ferences i n comprehension performance between natural
and synthetic speech
Additional research is needed to understand the precise
role that practice and familiarity plays i n comprehension
and understanding zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAAs we noted earlier, performance i n
comprehension tasks improves due to experience in listen-
ing to the synthetic speech Additional research should be
carried out to deal with issues surrounding the nature of
practice and familiarity effects and how the subject’s criteria
and perceptual strategies are modified after listening to
synthetic speech There are still many questions to be
answered: H o w much practice does a listener need? Does
performance using synthetic speech reach the same levels
as with natural speech? Does training reduce the capacity
demands imposed by synthetic speech? These questions
need to be studied in carefully designed laboratory experi-
ments using more sophisticated and sensitive measures of
perception and comprehension zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
B O n - Line Measures of Linguistic Processing
I n order t o understand the moment-to-moment demands
that occur while listening to fluent synthetic speech, we
will need t o use on-line measures that tap the real-time
computational processes used by listeners to perceive and
comprehend fluent speech The use of phoneme- and
word-monitoring tasks which require listeners to respond
while processing the speech input may provide some in-
sight into the covert processes listeners use to understand
synthetic speech (see [ I l l ) Other psycholinguistic tasks
such as mispronunciation detection [9] may also be useful
as well In these tasks, response latencies are used to
measure cognitive processing
C Habituation and Attention to Synthetic Speech
When listening to long passages of synthetic speech, one often experiences difficulty in maintaining focused atten- tion on the linguistic content of the passage While the results obtained in our listening comprehension tests indi- cated that subjects did, indeed, comprehend these passages quite well, we do not have any evidence that subjects paid full attention to the passages (see also [21]) We also have subjective reports from other experiments suggesting that subjects are “tuning in” and “fading out” as they listen to long passages of synthetic speech Is synthetic speech more fatiguing t o listen to than natural speech? Can a listener fully comprehend a passage when only part of the passage
is processed? How is the listener’s attention allocated in listening t o synthetic speech compared, for example, t o natural speech and how does it change with other demands
on processing capacity in short-term memory? These are all important questions that await further study zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
0 Subjective Evaluation and Listener Preference
I n addition to the quality of the synthetic speech signal itself, another consideration with respect to the evaluation
of synthetic speech concerns the user’s preferences and biases If an individual using a particular text-to-speech system cannot tolerate the sound of the speech or does not trust the information provided by the voice output device, the usefulness of this technology will be reduced With this goal i n mind, we have developed a questionnaire to assess
subjective ratings of synthetic speech [36] Some pre- liminary data have been collected using various types of stimulus materials and several synthesis systems In general,
we have found that listeners’ subjective evaluations of their performance generally correlates well with objective mea- sures of performance Also, we have found that the degree
to which subjects are willing to trust the information pro- vided by synthetic speech is positively correlated with ob- jective measures of performance For the naive user, poor performance predicts low levels of belief in the messages, whereas high levels of accuracy predict a greater degree of confidence
E Research on the Applications of Voice Output Technology
Finally, there are many unanswered questions related to the use of voice response systems in real-world applica- tions Additional research is needed on the use of synthetic speech in settings where it is already being used or could
be used Except for a few studies reporting the use of synthetic speech i n military, business, and industrial set- tings (see, for example, [17], [52], [56], [61n, most of the reports concerning the use of synthetic speech describe a
new or novel application but they do not evaluate the usefulness or success or failure of the application
VIII SUMMARY AND CONCLUSIONS Evaluating t h e use of voice response systems employing synthetic speech is not just a matter of conducting stan- dardized intelligibility tests Different applications will im- pose different demands and constraints on observers Thus
1674 PROCEEDINGS O F T H E IEEE, VOL zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA73, NO 11, NOVEMBER 1985