Perception of synthetic speech generated by rule

In this paper, we describe the results of several studies that applied measures of phoneme intell;gibility, word recognition, and comprehension to assess the perception of synthetic sp

Trang 1

Perception of Synthetic Speech Generated

I n v i t e d Paper zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

As the use of voice response systems employing synthetic speech

becomes more widespread in consumer products, industrial and

military applications, and aids for the handicapped, it will be

necessary to develop reliable methods of comparing different

synthesis systems and of assessing how human observers perceive

and respond to the speech generated by these systems The selec-

tion of a specific voice response system for a particular application

depends on a wide variety of factors only one of which is the

inherent intelligibility of the speech generated by the synthesis

coutines In this paper, we describe the results of several studies

that applied measures of phoneme intell;gibility, word recognition,

and comprehension to assess the perception of synthetic speech

Several techniques were used to compare performance of different

synthesis systems with natural speech and to learn more about how

humans perceive synthetic speech generated by rule Our findings

suggest that the perception of synthetic speech depends on an

interaction of several factors including the acoustic-phonetic p r o p

erties of the speech signal, the requirements of the perceptual task,

and the previous experience of the listener Differences in percep

tion between natural speech and high-quality synthetic speech

appear to be related to the redundancy of the acoustic-phonetic

information encoded in the speech signal

In the not too distant past, voice output systems could be

classified into two broad categories depending on the na-

ture of the synthesis process Speech-coding systems used a

fixed set of parameters t o reproduce a relatively limited

vocabulary of utterances These systems produced intelligi-

ble and acceptable speech [58], [ 5 9 ] at the cost of flexibility

i n terms of the range of utterances that could be produced

I n contrast! synthetic speech produced by rule provided less

intelligible and less natural sounding speech, but these

systems had the capability of automatically converting unre-

stricted text i n ASCII format into speech [2], [ 3 ] Over the

last few years, significant improvements in text-to-speech

Manuscript received January 15, 1985; revised July 3, 1985 This

research was supported, i n part, under NIH Grant NS-12179 and, in

part, under Contract AF-F 33615-83-K-0501 with the Air Force Sys-

tems Command, AFOSR, through the Aerospace Medical Research

Laboratory, Wright-Patterson AFB, Ohio Requests for reprints

should be sent t o the authors at the address below

The authors are with the Speech Research Laboratory, Depart-

ment of Psychology, Indiana University, Bloomington, IN 47405,

USA

systems have begun coded-speech voice

to eliminate the advantages of simple response systems over text-to-speech systems Extensive research on improving the letter-to-sound rules and phonetic implementation rules used by these systems, as well as the techniques of diphone and demisyl- lable synthesis i n text-to-speech systems suggest that, in the near future, unrestricted text-to-speech voice response devices may produce highly intelligible and very natural

sounding synthetic speech [ 2 3 ]

As the quality of the speech generated by text-to-speech systems improves, it becomes necessary to be able to evaluate and compare the performance of different synthesis systems The need for a systematic and reliable assessment

of the capabilities of voice response devices becomes even more critical as the complexity of these systems increases-this is especially true when considering the ad- vanced features that are now being offered by some of the newest systems such as DECtalk, Prose-2000, and lnfovox which provide capabilities for synthesis of different languages and generation of several different synthetic voices It is also important, in its own right, to learn more about how human listeners perceive and understand synthetic speech and how performance with synthetic speech differs from natural speech

If there existed a set of algorithms or a set of acoustic criteria that could be applied automatically to measure the quality of synthetic speech, there would be no question about describing the performance of a particular system or the effectiveness of new rules or synthesis techniques

Standards could be developed fairly easily and applied uniformly Unfortunately, there is no existing method for automating the assessment of synthetic speech quality The ultimate test of synthetic speech involves assessment and perceptual response by the human listener Thus it is necessary t o employ perceptual tests of synthetic speech under the conditions in which synthetic speech will be used The perception of speech depends on the human listener as

much as it does on the attributes of the acoustic signal itself and the system of rules used to generate the signal [42]

Although it is clear that the performance of systems that generate synthetic speech must be evaluated using objec-

0018-9219/85/1100-1665$01.00 0 1 9 8 5 IEEE

P R O C E E D I N G S OF zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAT H E IEEE VOL 73, NO 1 1 N O V E M B E R 1965 1665

Trang 2

tive perceptual tests, there have been only a handful of

studies dealing with the intelligibility of synthetic speech

over the years (e.g., [19], [38], [MI) And, there have been

even fewer discussions of the technical issues that surround

the problem of measuring the intelligibility of synthetic

speech (see, [35], [ a ] , [42]) Furthermore, it is important to

specify precisely which aspects of synthetic speech are

being evaluated O n one hand, the perception and compre-

hension of synthetic speech can be measured using a variety

of objective behavioral tests that provide precise and statis-

tically reliable estimates of the performance of a particular

voice response system in a specific condition These tests

investigate the transmission of linguistic information from

the speech signal t o the listener and address specific ques-

tions such as: 1) how accurately are synthetic phonemes

and words recognized, 2) how well is the meaning of a

synthetic utterance understood, and 3) how easy is it to

perceive and understand synthetic speech O n the other

hand, an equally important issue concerns the acceptability

and naturalness of synthetic speech, and whether the

listener prefers one type of speech output over another

Questions of listener preference cannot be addressed di-

rectly using objective performance measures such as the

proportion of words correctly recognized or response

latencies, but instead must be investigated more indirectly

by asking the listener for his or her subjective impressions

of the quality of synthetic speech using questions designed

t o assess different dimensions of naturalness, acceptability,

and preference [36]

I n the Speech Research Laboratory at Indiana University,

we have carried out a large number of studies over the last

five years t o learn more about the perception of synthetic

speech generated automatically by rule using several text-

to-speech systems (see [34], [37l, [42], [MI) Strictly speaking,

this work is not human factors research; that is, it is not

designed t o answer specific questions regarding the devel-

opment and use of specific products or techniques Rather,

the goal of our research has been to provide more basic

knowledge about the perception of synthetic speech under

well-defined laboratory conditions These research findings

can then serve as a benchmark for subsequent human

factors studies that may be motivated by more specific

problems of using voice response systems for a particular

application In general, our research has focused on mea-

suring the performance of human listeners who are re-

quired to perceive and respond to synthetic speech under a

variety of task demands and experimental conditions Dur-

ing the course of this work, we have also carried out several

comparisions of the performance of human listeners on

standardized perceptual tasks using synthetic speech gener-

ated by rule with a number of text-to-speech systems zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

11 CONSTRAINTS ON PERFORMANCE OF HUMAN OBSERVERS

To provide a framework for interpreting the results of our

research, we first consider a number of factors that are

known to affect an observer’s performance: 1) the specific

demands imposed by a particular task, 2) the inherent

limitations of the human information processing system, 3)

the experience and training of the human listener, 4) the

linguistic structure of the message set, and 5) the structure

and quality of the speech signal zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

A Task Complexity

The first factor that constrains performance concerns the complexity of the tasks that engage an observer during the perception of speech In some tasks, the response demands are relatively simple, such as deciding which of two known words was said Other tasks are extremely complex, such as

trying to recognize an unknown utterance from a virtually unlimited number of response alternatives, while engaging

in an activity that already requires attention There is a substantial amount of research i n the cognitive psychology and human factors literature demonstrating the powerful effects of perceptual set, instructions, subjective expectan- cies, cognitive load, and response set on performance i n a

variety of perceptual and cognitive tasks [63] The amount

of context and the degree of uncertainty i n the task also strongly affect an observer’s performance in substantial ways [22] Thus it is necessary to understand the requirements and demands of a particular task before drawing any strong inferences about an observer’s behavior or performance zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

B Limitations on the Observer

The second factor influencing recognition of synthetic speech concerns the structural limitations on the human information processing system’s ability to perceive, encode, store, and retrieve information Because the nervous system cannot maintain all aspects of sensory stimulation (and therefore must integrate acoustic energy over time), very severe processing limitations have been found in the human observer‘s capacity to encode and store raw sensory data in memory To overcome these capacity limitations, the listener must rapidly transform sensory input into more abstract neural codes for more stable storage i n memory and subsequent processing operations The bulk of the research i n perception and cognitive processes over the last

25 years has identified human short-term memory (STM) as

a major limitation on processing sensory input [50] The amount of information that can be processed i n and out of STM is severely limited by the listener’s attentional state, past experience, and the quality of the original sensory input

C Experience and Training

The third factor concerns the ability of human observers

t o quickly learn effective cognitive and perceptual strategies t o improve performance in almost any task When given appropriate feedback and training, subjects can learn

t o classify novel stimuli, remember complex pattern se- quences, and respond to rapidly changing stimulus patterns

i n different sensory modalities Clearly, the flexibility of subjects in adapting to the specific demands of a task is an important constraint that must be considered and con- trolled in any attempt to evaluate the perception of synthetic speech by the human observer

D Message Set

The fourth factor relates to the structure of the message set; that is, the constraints on the number of possible messages and the organization and linguistic properties of the message set A message set may consist of words that

1666 PROCEEDINGS O F THE IEEE, VOL 73, NO 11, NOVEMBER zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA1 9 8 5

Trang 3

are distinguished only by a single phoneme or may consist

of words and phrases with very different lengths, stress

patterns, and phonotactic structures Use of this constraint

by listeners depends on linguistic knowledge [27] The

choice and arrangement of speech sounds into words is

constrained by the phonological rules of language; the

arrangement of words in sentences is constrained by syntax;

and finally, the meaning of individual words and the overall

meaning of sentences in a text is constrained by semantics

and pragmatics of language The contribution of these vari-

ous levels of linguistic structure to perception will vary

substantially from isolated words, to sentences, to passages

of fluent continuous speech zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

E Signal Characteristics

The fifth factor refers to the acoustic-phonetic and pro-

sodic structure of a synthetic utterance This constraint

refers to the veridicality of the acoustic properties of the

synthetic speech signal compared to naturally produced

speech Speech signals may be thought of as the physical

realization of a complex and hierarchically organized sys-

tem of linguistic rules that map sounds onto meanings and

meanings back onto sounds At the lowest level in the

system, the distinctive properties of the speech signal are

constrained i n substantial ways by vocal tract acoustics and

articulation The acoustic-phonetic structure of natural

speech reflects these physical and contextual constraints;

synthetic speech is an impoverished signal representing

phonetic distinctions with only a limited subset of the

acoustic properties used t o convey phonetic information in

natural speech Furthermore, the acoustic properties used

t o represent segmental structure in synthetic speech are

highly stylized and are insensitive to phonetic context when

compared t o natural speech

There are basically three areas in which a text-to-speech

system could produce errors that would impact the overall

intelligibility of the speech: 1) the spelling-to-sound rules,

2) the computation and production of suprasegmental in-

formation, and 3) the phonetic implementation rules that

convert the internal representation of phonemes and/or

allophones into a speech waveform [2], [4] In our previous

research, we have found that phonetic implementation

rules are a major factor in determining the segmental intel-

ligibility of a voice response system [33] In the perceptual

studies described below, we have focused most of our

attention on measures of segmental intelligibility, assuming

that the letter-to-sound rules used by a particular text-to-

speech system were applied correctly

A Phoneme Intelligibility

The task that has been used most often in previous

studies evaluating synthetic speech and is now accepted as

the de facto standard measure of the segmental intelligibil-

ity of synthetic speech is the Modified Rhyme Test ([14],

[38]; however, see [ a ] for a different opinion) In the

Modified Rhyme Test (MRT), subjects are required to iden-

tify a single word by choosing one of six alternative re-

sponse words differing by a single phoneme in either initial

or final position [18] All the stimuli in the MRT are consonant-vowel-consonant zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA(CVC) monosyllabic words; on half

the trials, the responses share the vowel-consonant portion

of the stimulus and on the other half, the responses share the consonant-vowel portion Thus the MRT provides a measure of the performance of listeners in identifying either the initial or final phoneme of a set of spoken words

To date, we have evaluated natural speech and synthetic speech produced by five different text-to-speech systems:

the Vortrax Type-'N'-Talk, the Speech Plus Prose-2000, the MITalk-79 research system, Infovox, and DECtalk (see [15])

The major findings are summarized in Table 1 Performance

Table 1 Percent Correct Performance Obtained for Modified Rhyme Test (MRT) Experiments Conducted at the Speech Research Laboratory (1979-1984)

System Tested MRT MRT (date) Closed Open Natural Speech*

MITalk-79 Research System*

Prototype Prose-2000**

Votrax Type-'N'-Talk***

DECtalk Paul v1.8' DECtalk Betty v1.8++

Current Working Prose+++

Prose-2000 v3.0 lnfovox

(1 1 /79) (6/79-9/79) (1 2/79) (3/82)

( 3 / W

(8/W

( 9 / W zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

(3/85) (3/85)

99.4 97.2

93 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA.I 75.4 87.6

66.2 96.7 94.4

94.3 87.4

86.7 82.5

'Pisoni and Hunnicutt, [MI

**Bernstein and Pisoni, [5] zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

***Pisoni and Koen, [45]

+Creene, Manous, and Pisoni, [15]

+++Manous, Creene, and Pisoni, 1W

++ Creene, Manous, and Pisoni, 1964 (final report)

i n the MRT for natural speech was the best at 99.4 percent correct For DECtalk v1.8, we evaluated speech produced by

"Paul" and "Betty," two of DECtalk's nine voices, and found different levels of performance on these voices 96.7 percent of the words spoken by the Paul voice were identified correctly while only 94.4 percent of Betty's words were identified correctly The level of performance observed for the Paul voice comes the closest to natural speech and is considerably higher than performance for any of the other text-to-speech systems we have studied

Performance on MITalk-produced speech was somewhat lower than either of the DECtalk v1.8 voices at 93.1 percent correct word identification The prototype of the Prose-2000 produced speech that was identified at 87.6 percent correct;

version 3.0 of the Prose-2000 has improved with performance at 94.3 percent correct The lnfovox multilingual system produced English speech that was identified at 87.4 percent correct Finally, the least intelligible synthetic speech was produced by the Votrax Type-'N'-Talk with only 67.2 percent correct identification These results, obtained under closely matched testing conditions in the same laboratory environment, show a wide range of variation

PlSONl zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAet d l : PERCEPTION OF SYNTHETIC SPEECH GENERATED BY RULE 1667

Trang 4

among currently available text-to-speech systems In our

view, these differences in performance directly reflect the

amount of basic research that was carried out to develop

the phonetic implementation rules of these different voice

response systems

In addition to the standard closed-response MRT, we

have also used an open-response format version of the

MRT In this procedure, listeners are instructed to write

down the word that they heard on each trial This open-

response format provides a measure of performance when

constraints on the response set are minimized (all CVC

words known to the listener compared to the six alternative

responses i n the closed-response version) This procedure

also provides information about the intelligibility of vowels

that is not available in the closed-response set version of

the MRT A comparison of the closed- and open-response

versions of the MRT for synthetic speech produced by

different text-to-speech systems with natural speech indi-

cates the degree to which listeners rely on response-set

constraints Some representative findings using the open-

response MRT format are also shown in Table 1

Performance on the open-response set MRT for natural

speech was at 97.2 percent correct exact word identification

compared t o 99.4 percent correct in the closed-response set

task Even when there are no strong constraints on the

number of alternative responses for natural speech, perfor-

mance is still better than for any of the text-to-speech

systems with zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAa constrained set of responses For the MITalk-

79 research system, performance in the open-set MRT task

is, however, considerably worse at 75.4 percent correct

Similarly, DECtalk's Paul voice was identified at the 86.7

percent level; correct word identification for "Betty" was

82.5 percent correct These results show a large interaction

between intelligibility measured in the closed-response for-

mat MRT and the open-response format MRT Although the

rank ordering of intelligibility remains the same across the

t w o forms of the MRT, it is clear that as speech becomes

less intelligible, listeners rely more heavily on response-set

constraints t o aid performance zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

8 Word Recognition in Sentences

To examine the contribution of several linguistic con-

straints on performance, we compared word recognition in

t w o types of sentence contexts The first type of sentence

context was syntactically correct and meaningful-the

Harvard psychoacoustic sentences [13] An example is given

i n (1) below:

Add salt before you fry the egg zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA(1)

The second type of sentence context was syntactically

correct, but these sentences were semantically anomalous

-the Haskins syntactic sentences [38] These test sentences

had the syntactic form of normal sentences, but they were

nonsense An example of this type of nonsense sentence is

given in (2) below:

The old farm cost the blood zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA( 2 )

By comparing word recognition performance for these

t w o classes of sentences, it was possible to determine the

influence of sentence meaning and linguistic constraints on

word recognition [IS] Table 2 shows percent correct word

identification for meaningful and semantically anomalous

Table 2 Percent Correct Word Recognition for Meaningful and Semantically Anomalous Sentence Contexts

Type of Sentence Context Type of Speech Meaningful (%) Anomalous (%) Natural 99.2 97.7 MlTalk-79 93.3 78.7 Prototype Prose-2000 83.7 64.5

DEC Paul VI .8 95.3 86.8

DEC Betty VI .8 90.5 75.1 Current Working Prose (9/84) 9 l .O zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

sentences for natural speech, and synthetic speech produced by MlTalk-79, the Speech Plus Prose-2000 prototype, and for DECtalk's Paul and Betty voices (v1.8) For natural and synthetic speech, word recognition was much better in meaningful sentences than in the semantically anomalous sentences Furthermore, a comparison of correct word identification in these sentences reveals an interaction in performance suggesting that semantic constraints are relied on by listeners much more when the speech becomes progres- sively less intelligible zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

C Listening Comprehension

Spoken language understanding is a very complex cognitive process that involves the encoding of sensory information, retrieval of previously stored knowledge from long-term memory, and the subsequent interpretation and integration of various sources of knowledge available to a

listener [26], [39] Language comprehension, therefore, depends on a relatively large number of diverse and complex factors, many of which are still only poorly understood by cognitive psychologists at the present time Measuring comprehension is difficult, therefore, because of the interaction

of several different knowledge sources in the comprehension process This problem is made worse because there is

no coherent theoretical model of language comprehension

to guide the development of measurement procedures

Moreover, there are presently no theoretical models that can deal with the diverse strategies employed by listeners

to mediate language understanding under a wide variety of listening conditions and task demands

One of the factors that obviously plays an important role

in listening comprehension is the quality of the initial input signal-that is, the intelligibility of the speech itself But the acoustic-phonetic properties of the input signal are only one source of information used by listeners in speech perception and spoken language understanding As we have seen from the results summarized i n the previous sections, additional consideration must also be given to the contribution of higher levels of linguistic knowledge to perception and comprehension

In our initial attempts to measure comprehension of synthetic speech, we wanted to obtain a gross estimate of

h o w well listeners could understand the linguistic content

of continuous, fluent speech produced by the MITalk-79 text-to-speech system (see [ a ] , [MI) As far as we have been able t o determine, little attention has actually been devoted

t o the problems surrounding comprehension of the linguistic content of synthetically produced speech, particularly passages of meaningful fluent continuous speech [21], [ a ]

To assess comprehension, we selected fifteen narrative

1668 PROCEEDINGS O F THE zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAIEEE VOL 73, NO 11, NOVEMBER 19W

Trang 5

passages and an appropriate set of multiple-choice test

questions from several standardized adult reading compre-

hension tests [IO], [20], [30], [55] The passages were quite

diverse, covering a wide range of topics, writing styles, and

vocabulary These passages were also selected to be inter-

esting for subjects to listen to in the context of laboratory-

based tests designed to assess language understanding

Since these test passages were chosen from several differ-

ent types of reading tests, they varied in difficulty and style

This variation permitted us to evaluate the contribution of

all of the individual components of a particular text-to-

speech system to comprehension in one relatively gross

measure We assumed that the results of these comprehen-

sion tests would provide an initial benchmark against which

the entire text-to-speech system could be evaluated with

materials that would be comparable to those used i n a

more realistic application such as a reading machine for the

blind or database information retrieval system [I], [2]

I n our initial study, we tested three groups of naive

subjects with 20 subjects i n each group zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI One group of

subjects listened to MITalk-79 versions of the passages,

another listened to natural speech, while a third group read

the passages silently All three groups answered the same

set of test questions immediately after each passage In a

subsequent study [5], a group of subjects listened to the

prototype of the Speech Plus Prose-2000 (then known as

Telesensory Systems, Incorporated) The same prose pas-

sages were used in this study as in the original study

The comprehension results for all groups are summarized

in Table 3 Averaged over the last thirteen test passages, the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

Table 3 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAPercent Correct Performance on the

Comprehension Tests [5], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA [a], [44]

1 s t Half 2nd Half Total (6 passages) (6 passages) (1 3 passages) (%) (X) (%I

MITalk-79 64.1 74.8 70.3

Natural Speech 65.6 68.5 67.8

Prose-2000 (TSI) 60.9 67.3 65.2

Reading 76.1 77.2 77.2

reading group showed a significant 7 percent advantage zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

( p < 0.a) over the synthetic speech group and a 12 percent

advantage over the Prose (TSI) group ( p zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA0.OOl) However,

the differences i n performance between the groups ap-

peared to be localized primarily i n the first half of the test

By the second half, performance for the groups listening t o

synthetic speech improved substantially whereas perfor-

mance for the reading group remained about the same

Although the scores for the natural speech group were

slightly lower overall, no improvement was observed in

their performance from the first half to the second half of

the test

The finding of improved performance in the second half

of the test for subjects listening to synthetic speech is

consistent with the earlier results from word recognition i n

sentences which showed that recognition performance im-

proves for synthetic speech after only a short period of

exposure These results suggest that the overall difference

in performance between the groups is probably due to

familiarity with the output of the synthesizer and not due

t o any inherent difference in the basic strategies used i n

comprehending or understanding the content of these passages

D Conclusions: Intelligibility and Comprehension

The results of the Modified Rhyme Test revealed relatively high levels of segmental intelligibility for speech generated by MITalk-79, Prose-2000, Infovox, and DECtalk

The results for the Votrax Type-’N’-Talk using this measure showed much lower levels of performance The progres- sion from MITalk-77 (the forerunner of Prose-2000), to the MITalk-79 research system, to Infovox, to the Prose-2000, and finally t o DECtalk shows the continual refinement of speech synthesis technology With additional research and development, the speech generated by these high-quality text-to-speech systems may soon approach the almost per-

fect levels of intelligibility observed for natural speech under laboratory testing conditions

The results from the two sentence tasks indicated that context is a powerful aid to recognition When both semantic and syntactic information is available to subjects, higher levels of performance were obtained in recognizing words i n sentences However, when the use of semantic knowledge was modified or eliminated, as in the Haskins sentences, subjects must, of necessity, rely primarily on the acoustic-phonetic information in the signal and their knowledge of morphology Clearly, the contribution of higher level sources of knowledge is responsible for the superior performance obtained on Harvard sentences; in the absence of this knowledge, subjects’ performance was considerably poorer

Finally, the results of the listening comprehension tests

reveal that subjects are able to correctly answer multiple- choice comprehension questions about the content of passages of fluent connected synthetic speech After only a few minutes of exposure to the output of a speech synthesizer, comprehension improves substantially and eventu- ally approximates levels observed when subjects read the same passages of text or listen to naturally produced versions of the same materials There are, however, a number

of problems in measuring comprehension with the materials we have used First, these materials were designed to measure reading comprehension, not listening comprehension Thus for these tests, a reader was expected to re-read the material in order to answer some of the questions The reader always has access to the passage; the listener cannot

go back and hear some portion of the passage again

Second, the multiple-choice questions do not directly assess the perceptual processes used to encode the speech input

Moreover, these questions measure comprehension after the materials have been presented therefore reflecting post-perceptual comprehension strategies and subject biases Thus multiple-choice questions are not measures of the on-line, real-time cognitive processes used in comprehension but reflect the final product of comprehension

Iv PERCEPTUAL ENCODING OF SYNTHETIC SPEECH

The results of the MRT and word identification studies of natural and synthetic speech clearly indicate that synthetic speech is less intelligible than natural speech In addition, these studies demonstrate that, as synthetic speech becomes less intelligible, listeners rely more on linguistic

PISONI zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAet a/: PERCEPTION OF SYNTHETIC SPEECH GENERATED BY R U L E 1669

Trang 6

knowledge and response-set constraints to aid word identi-

fication However, these studies do not account for the

differences in perception between natural and synthetic

speech; rather they just demonstrate and describe some of

these basic differences zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

A Lexical Decision and Naming Latencies

In order t o begin to investigate differences i n the percep-

tual processing of natural and synthetic speech, we carried

out a series of experiments designed to measure the time

listeners need to recognize words and pronounceable non-

words produced by a human talker and a text-to-speech

system In carrying out these studies, we wanted to know

how long it takes a listener to recognize an isolated word,

and h o w the process of word recognition might be affected

by the quality of the acoustic-phonetic information in the

signal To measure the duration of the recognition process,

we used a lexical decision task zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[41], [54] Listeners were presented with a single word or a nonword stimulus item

on each trial Each listener was required to classify the item

as either a “word” or a “nonword” as quickly and accu-

rately as possible by pressing one of two buttons located on

a response box that was interfaced to a minicomputer

Examples of the stimuli are shown in Table 4

Table 4 Examples of Lexical

Decision Stimuli

PROMINENT PRADAMENT

BAKED BEPT

TINY TADGY

GLASS CEEP

PARENTS PEEMERS

TOLD TAVED

BLACK BAEP

CONCERT CAELIMPS

DARK D U T

BABBLE BURTLE

CRITIC CRAENICK

BOUGHT BUPPED

PAIN POON

GORGEOUS GAETLESS

COLORED COOBERED

Subjects responded significantly faster to natural words zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

sponse times t o the synthetic speech were 145 ms longer

than response times to the natural speech These findings

demonstrate two important differences i n perception be-

tween natural and synthetic speech First, perception of

synthetic speech requires more cognitive “effort” than the

perception of natural speech Second, because the dif-

ferences i n latency were observed for both words and

nonwords alike, and therefore do not depend on the

lexical status of the test item, the extra processing effort

appears t o be related t o the process of extracting the

acoustic-phonetic information from the signal and not the

process of identifying words in the lexicon In short, the

pattern of results suggests that the perceptual processes

used t o encode synthetic speech require more cognitive

“effort“ or resources than the processes used to encode

natural speech

Similar results were obtained by Pisoni [42] in a naming

task using natural and synthetic words and nonwords As i n

the lexical decision experiment, subjects were much slower

to name synthetic test items than natural test items More- over, this difference was again observed for both words and nonwords The naming results demonstrate that the extra processing time needed for synthetic speech does not depend on the type of response made by the listener, since the results were comparable for both manual and vocal responses Taken together, these two sets of findings demonstrate that early stages of encoding synthetic speech require more processing time than encoding natural speech

Several additional studies were carried out to determine the nature and extent of the encoding differences between natural and synthetic speech

6 Consonant- Vowel (CV) Confusions

Several hypotheses can be proposed to account for the greater difficulty of encoding synthetic speech One hypothesis that has been suggested recently is that synthetic

speech is simply equivalent to “noisy” natural speech [8],

[56] That is, the acoustic-phonetic structure of synthetic speech is more difficult to encode than natural speech for the same reasons that natural speech presented in noise i s

hard t o perceive-the acoustic cues to phonemes are ob- scured, masked, or physically degraded in some way by the masking noise According to this view, synthetic speech is

on the same continuum as natural speech, but it is degraded in comparison with natural speech In contrast, an alternative hypothesis, and the one we prefer, is that synthetic speech is not like “noisy” or degraded natural speech

at all, but instead may be thought of as ”perceptually impoverished” relative to natural speech By this account, synthetic speech is fundamentally different from natural speech in both degree and kind because many of the important criteria1 acoustic cues are either poorly represented or not represented at all

Spoken language is structurally rich and redundant at all levels of linguistic analysis In particular, natural speech is

highly redundant at the level of acoustic-phonetic structure Natural speech contains multiple acoustic cues for almost every phonetic distinction and these cues change as

a function of context, speaking rate, and talker However, in synthesizing speech by rule, only a small subset of the possible cues are typically implemented as phonetic implementation rules As a result, some phonetic distinctions may be minimally cued, perhaps by only a single acoustic attribute If all cues do not have equal importance i n different phonetic contexts, a single cue may not be perceptually sufficient t o convey a particular phonetic distinc-

tion in all utterances (see [12]) Moreover, the reliance on

minimal sets of cues in generating synthetic speech could

be disastrous for perception if a particular phonetic distinction i s incorrectly synthesized or masked by environmental noise Indeed, many of the errors we have found in our analyses of the M R T data suggest that this account is correct These two hypotheses concerning the structural relation- ships between synthetic and natural speech make different predictions about the types of errors and the distribution of perceptual confusions that should be observed with synthetic speech compared to natural speech According to the

“noisy speech” hypothesis, synthetic speech is similar to natural speech that has been degraded by the addition of noise Therefore, the perceptual confusions that occur with

~ 5 1

Trang 7

synthetic speech should be very similar to those obtained

with natural speech heard in noise By comparison, the

“impoverished speech” hypothesis claims that the acous-

tic-phonetic structure of synthetic speech is not as rich or

redundant i n segmental cues as natural speech According

t o this hypothesis, two patterns of confusion errors should

occur i n the perception of synthetic speech When the

acoustic cues used to specify a phonetic segment are not

sufficiently distinctive, confusions should occur between

minimally cued segments that are phonetically similar This

error pattern should be similar to the errors predicted by

the noisy speech hypothesis, since perceptual confusions of

natural speech i n noise also depend on the acoustic-

phonetic similarity of the segments [29], [62] However, the

t w o hypotheses may be distinguished by the presence of a

second type of error that is only predicted by the im-

poverished speech hypothesis If the minimal acoustic cues

used to specify phonetic contrasts are incorrect or mislead-

ing as a result of poorly specified phonetic implementation

rules, then confusions should occur that are not based on

the nominal acoustic-phonetic similarity of the confused

segments Instead, these confusions should be entirely

determined by the listener’s perceptual interpretation of

the misleading cues Thus the pattern of confusions

observed with synthetic speech should be phonetically quite

different from the expected ones based on the acoustic-

phonetic similarity of natural speech

To investigate the predictions made by these two hy-

potheses, Nusbaum, Dedina and Pisoni [32] examined the

perceptual confusions that arise within a set of 48 natural

and synthetic consonant-vowel (CV) syllables as stimuli

These were constructed from the vowels /i, a, u/ and the

consonants /b, d, g, p, t, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAk, n, m, r, I, w, j, s, f, z, v/ The

natural CV syllables were produced by a male talker The

synthetic syllables were generated by three different text-

to-speech systems-the Votrax Type-IN’-Talk, the Speech

Plus Prose-2000 v2.1, and the Digital Equipment Corpora-

tion DECtalk v1.8 To assess the pattern of perceptual con-

fusions that occur for natural speech, the natural syllables

were presented to listeners at four signal-to-noise zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA( S / N )

ratios of +28, 0, - 5 , and -10 dB

When averaged over the three vowel contexts, the results

showed that natural speech at +28 dB S/N was the most

intelligible (96.6 percent correct), followed by DECtalk (92.2

percent correct), followed by the Prose-2000 (62.8 percent

correct) The Type-IN’-Talk showed the worst performance

(27.0 percent correct) Of special interest were the results of

more detailed error analyses which revealed that the distri-

butions of perceptual confusions obtained for natural and

synthetic speech were often quite different For example, in

the case of DECtalk, 100 percent of the errors made in

identifying the segment /r/ were due to confusions with

/b/ even though this type of error never occurred for

natural speech at +28 dB S/N Even at the poorest S / N

(-10 dB) where the intelligibility of natural speech in noise

was actually worse than DECtalk presented without noise

(29.1 percent correct versus 92.2 percent correct), this type

of error accounted for only 3 percent of the total errors

observed for this segment

In order t o examine the segmental confusions more pre-

cisely, we compared the confusion matrices for a particular

text-to-speech system with the confusion matrices for natu-

ral speech presented at a signal-to-noise ratio that resulted

i n comparable overall levels of identification performance

We compared the confusion matrices for the Prose-2000 with natural speech presented at 0 dB S / N and the confusion matrices for Votrax with natural speech presented at -10 dB S/N An examination of the proportion of the total

errors contributed by each response class (stop, nasal, liquid/glide, fricative, other) indicated that, for natural speech, most of the errors in identifying stops were due to responses that were other stop consonants In contrast, the errors found with the Prose-2000 appeared to be more evenly distributed between stop, liquid/glide, and fricative responses I n other words, more intrusions appeared from other manner classes in the errors observed with the Prose-

2000 synthetic speech than for the natural speech produced

in noise Thus the different pattern of errors obtained for Prose-2000 and natural speech suggests that the errors produced by the Prose-2000 may be “phonetic miscues” rather than true phonetic confusions

The comparison between natural speech at -10 dB S/N

and Votrax speech indicated that the pattern of errors in identifying stops was more similar for these conditions

Indeed, the comparison of identification errors for natural

speech at 0 dB and -10 dB S/N was quite similar to the

comparison between Votrax and natural speech Thus at least for the perception of stop consonants, the confusions

of Votrax speech seem to be based on the acoustic-phonetic similarity of the confused segments as in noisy speech

However, it should be emphasized that the overall performance level for Votrax synthetic speech was quite low to begin with Therefore, these errors could reflect similarities that occur when performance begins to approach chance levels

A very different pattern of results was obtained for the errors that occurred in the perception of liquids and glides

The distribution of errors for Prose-2000 speech and natural speech revealed that similar confusions were made for liquids and glides for both types of speech However, the results were quite different for the errors made with Votrax speech and natural speech for these phonemes For liquids and glides, the largest number of errors for Votrax speech resulted from confusions with stop consonants while for natural speech, relatively few stop responses were observed Thus for liquids and glides, errors in perception of Prose-2000 speech seem to be based on acoustic-phonetic similarity while the errors for Votrax speech seem to be phonetic miscues

In summary, based on these confusion analyses, it should

be obvious that the predictions made by the noisy speech hypothesis are simply incorrect Two different types of errors were observed in the perception of synthetic speech

Some consonant identification errors were based on the acoustic-phonetic similarity of the confused segments

Other errors follow a pattern that can only be explained as phonetic miscues; these are errors in which the acoustic cues used i n synthesis specified the wrong segment in a

particular context zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

C Gating and Signal Duration

The results of the consonant-vowel confusion experi-

ment support the conclusion that the differences in perception between natural and synthetic speech are largely the result of differences in the acoustic-phonetic properties of

PlSONl et d l : PERCEPTION O F SYNTHETIC SPEECH GENERATED BY R U L E 1671

Trang 8

the signals More recently, we have found further support

for this account using the gating paradigm [16], [47] to

investigate the perception of natural and synthetic words

In an experiment carried out recently by Manous and Pisoni

[25] listeners were presented with short segments of either

natural or synthetic words for identification O n the first

trial using a particular word, the first 50 ms of the signal was

presented for identification O n subsequent trials, the

amount of signal duration was increased in 50-ms steps so

that on the next trial, 100 ms of the word was identified,

and on the next trial 150 ms of the word was identified, and

so on, until the entire word was presented Manous and

Pisoni found that, on the average, natural words could be

identified after 67 percent of a word was heard; for syn-

thetic words, it was necessary for listeners to hear 75

percent of a word for correct word identification These

gating results demonstrate more directly that the acoustic-

phonetic structure of synthetic words conveys less informa-

tion (per unit of time) than the acoustic-phonetic structure

of natural speech zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

0 Conclusions: Perceptual Encoding zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

Taken together, our results provide strong evidence that

encoding of the acoustic-phonetic structure of synthetic

speech i s more difficult and requires more cognitive effort

and capacity than encoding natural speech One source

of support for this conclusion comes from the finding

that recognition of words and nonwords requires more

processing time for synthetic speech compared to natural

speech This result indicates that a major source of difficulty

in recognition is the extraction of phonetic information and

not word recognition since the same result was obtained

for both words and nonwords This conclusion is supported

further by the findings of the zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBACV confusion study This

experiment demonstrated significant differences i n percep-

tion of the acoustic-phonetic structure of synthetic and

natural speech Synthetic speech may be viewed as a

phonetically impoverished signal compared to natural

speech This was demonstrated clearly i n the gating experi-

ment using natural and synthetic speech The results ob-

tained from this experiment suggest that synthetic speech

requires more acoustic-phonetic information to correctly

identify isolated monosyllabic words

Taken together, the overall pattern of findings suggests

that the differences i n processing time between natural and

synthetic speech probably lie at processing stages involved

in the extraction of basic acoustic-phonetic information

from the speech waveform-that is, the early pattern recog-

nition process itself rather than at the more cognitive levels

involved i n lexical search or retrieval of words from the

mental lexicon (see zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[43n

The results obtained in the lexical decision and naming

tasks also demonstrate that even with relatively high levels

of performance accuracy, synthetic speech requires more

cognitive processing time than natural speech to recognize

words presented in isolation In these studies, however,

subjects were performing relatively simple and straightfor-

ward tasks As we noted at the outset, the specific task

demands of a perceptual experiment almost always affect

the speed and accuracy of a listener’s response The next

series of experiments we will describe was designed t o

impose additional cognitive demands on subjects besides those already incurred by the acoustic-phonetic properties

of synthetic speech

Recent work on human selective attention has suggested that cognitive processes are limited by the capacity of short-term or working memory [SI] Thus any perceptual process that imposes a load on short-term memory may interfere with decision making, perceptual processing, and other subsequent cognitive operations If perception of synthetic speech imposes a greater demand on the capacity

of short-term memory than perception of natural speech, then the use of synthetic speech in applications where other complex cognitive operations are required might produce serious problems in recognition of the message

Several years ago, Luce, Feustel, and Pisoni [24] conducted a series of experiments that were designed to determine the effects of processing synthetic speech on short-term memory capacity In one experiment, on each trial, subjects were given two different lists of items to

remember The first list consisted of a set of digits visually

presented o n a CRT screen On some trials, no digits were

presented On other trials, either three or six digits were presented i n the visual display Following the visual list, subjects were presented with a spoken list of ten natural words or ten synthetic words After the spoken list was presented, the subjects were instructed to write down all the visual digits in the order of presentation and all the words they could remember from the auditory list

Across all three visual conditions (no list, three, or six

digits), recall of the natural words was significantly better than recall of the synthetic words In addition, recall of the synthetic and natural words became worse as the size of the digit lists increased In other words, increasing the number of digits held in short-term memory impaired recall

of the spoken words But the most important finding was the interaction between the type of speech presented (synthetic versus natural) and the number of digits presented (three versus six) This interaction was revealed by the number of subjects who could recall all the digits pre-

sented i n correct order As the size of the digit lists in-

creased, significantly fewer subjects were able to recall all the digits for the synthetic words compared to the natural words Thus perception of the synthetic speech impaired recall of the visually presented digits more with increasing digit list size than did natural speech These results demonstrate that synthetic speech requires more short-term memory capacity than natural speech As a result, it would be expected that synthetic speech should interfere much more with other cognitive processes because it imposes greater capacity demands on the human information processing system than natural speech

To test this prediction, Luce et al [24] carried out another experiment in which subjects were presented with lists of ten words to be memorized The lists were either all synthetic or all natural words The subjects were required to recall the words in the same order as the original presentation As in the previous experiment, the natural words were recalled better overall than the synthetic words However, a more detailed analysis revealed an interaction in recall

1672 PROCEEDINGS O F THE IEEE, VOL 7 3 , NO zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA11, NOVEMBER 1985

Trang 9

performance depending on the position of items in the list

The first synthetic words heard in the list were recalled

much less accurately than the natural words in the begin-

ning of the lists This result demonstrated that, i n the

synthetic lists, the words heard later in each list interfered

with active rehearsal of the words heard earlier in the list

This is precisely the result that would be expected if the

perceptual encoding of the synthetic words placed greater

processing demands on short-term memory zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[46]

The data o n serial-ordered recall of lists of natural and

synthetic speech support the conclusion from the lexical

decision research that the processing of synthetic speech

requires more effort than perception of natural speech The

perceptual encoding of synthetic speech requires more

cognitive capacity and may, i n turn, affect other cognitive

processes that require active attentional resources Previous

research on capacity limitations in speech perception dem-

onstrated that paying attention to one spoken message

seriously impairs the listener’s ability to detect specific

words i n other spoken messages (e.g., zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[6], [57]) Moreover,

several recent experiments have shown that attending to

one message significantly impairs phoneme recognition in a

second stream of speech [31] Taken together, these studies

indicate that speech perception requires active attention

and cognitive capacity, even at the level of encoding

phonemes As a result, increased processing demands for

encoding synthetic speech may place important perceptual

and cognitive limitations on the use of voice response

systems in high information load conditions or severe en-

vironments This would be especially true in cases where a

listener is expected to pay attention to several different

sources of information at the same time

VI TRAINING A N D EXPERIENCE WITH SYNTHETIC SPEECH

The human observer is a very flexibile processor of infor-

mation With sufficient experience, practice, and special-

ized training, observers are able to overcome some of the

limitations on performance we have observed in our previ-

ous studies indeed, several researchers (e.g., [7], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI) have

reported a rapid improvement in recognition of synthetic

speech during the course of their experiments These im-

provements appear to be the result of subjects learning to

process the acoustic-phonetic structure of synthetic speech

more effectively However, it is also possible that the ;e-

ported improvements in intelligibility of synthetic speech

were simply due to an increased familiarity with the experi-

mental procedures rather than a real improvement in the

perceptual processing of the synthetic speech In order to

test these alternatives, Schwab, Nusbaum, and Pisoni [49]

carried out an experiment to separate the effects of training

o n task performance from improvements in the recognition

of synthetic speech

Three groups of subjects were given a pre-test with

synthetic speech on Day 1 and a post-test with synthetic

speech on Day 10 of the experiment The pre-test estab-

lished baseline performance for the Votrax Type-’N‘-Talk

text-to-speech system; the post-test on Day 10 was used zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAto

determine i f any improvements had occurred in recognition

of the synthetic speech after training The low-cost Votrax

system was used primarily because of the poor quality of i t s

segmental synthesis Thus ceiling effects i n performance

would not obscure any effects of training and there would

be room for improvement to occur during the course of the experiment

The three groups of subjects were treated differently on Days 2-9 One group received training with Votrax synthetic speech One group was trained with natural speech using the same words, sentences, and paragraphs as the group trained on synthetic speech This second group served

t o control for familiarity with the specific experimental tasks Finally, a third group received no training at all on Days 2-9

O n the pre-test and post-test days, the subjects were given the MRT, isolated phonetically balanced (PB) words, and sentences for transcription The word lists were taken from PB lists; the sentences consisted of both meaningful and semantically anomalous sentences used in our earlier work Subjects were given different materials to listen to on every day of the experiment During all the training sessions (i.e., Days 2-9), subjects were presented with spoken words and sentences, and received feedback indicating the iden- tity of the stimulus presented on each trial

The results showed that performance improved dramati- cally for only one group-the subjects that were trained with the Votrax synthetic speech At the end of training, the Votrax-trained group showed significantly higher levels of performance than either of the other two groups To take one example, performance in identifying isolated PB words improved for the Votrax-trained group from about 25 percent correct on the pre-test to almost 70-percent correct word recognition on the post-test Similar improvements were found for all the word identification tasks

The results of this training study suggest several important conclusions First, the effects of training appear to

be related t o improving or modifying the encoding process used t o recognize words Clearly, subjects were not simply learning to perform the various tasks better, since the subjects trained on natural speech showed little or no improvement in performance Moreover, training affected performance similarly with isolated words and words in sentences, and for closed- and open-response sets The pattern of results strongly suggests that subjects in the group trained

o n synthetic speech were not memorizing individual test items nor were they learning special strategies; that is, they did not learn t o use linguistic knowledge or task constraints

t o improve their recognition performance Rather, subjects learned something about the structural characteristics of this particular text-to-speech system that enabled them to perform better regardless of the task This conclusion is further supported by the design of the training study Im- provements in performance were obtained with novel materials even though the subjects never heard the same words or sentences more than once during the entire experiment In order to show improvements in performance, subjects must have acquired detailed information and knowledge about the rule system used to generate the synthetic speech They could not have shown improvements i n performance on the post-test if they simply learned

to memorize individual words or sentences since novel materials were used in this test too

i n addition to these findings, we also found that subjects retained the training even after six months with no further contact with the synthetic speech Thus it appears that

PlSONl et d l : zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAPERCEPTION zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAOF SYNTHETIC SPEECH GENERATED BY R U L E 1673

Trang 10

training produced a relatively stable and long-term change

in the perceptual encoding processes used by subjects

Furthermore, it is likely that more extensive training would

have produced even greater persistence of the training

effects If subjects had been trained to asymptotic levels of

performance, the long-term effects of training might have

been even more stable The results of this study demon-

strate that human listeners can modify their perceptual

strategies i n encoding synthetic speech and that substantial

increases in performance can be realized in relatively short

periods of time even with poor-quality synthetic speech

A Research on Comprehension

Most of the research on synthetic speech produced by

text-to-speech systems has, in the past, been concerned

with the acoustic-phonetic output generated by these sys-

tems (see, however, [21], zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA[MI, [49]) Researchers have focused

attention o n improving the segmental intelligibility of syn-

thetic speech At this point in time, the available perceptual

data suggest that segmental intelligibility is quite good for

some systems (DECtalk, Prose, Infovox) and, while not at

the same level as natural speech, it may take a great deal of

additional effort to achieve relatively small gains in im-

provement O n the other hand, little research effort has

been directed towards assessing.listening comprehension in

a more general sense Our initial efforts used relatively

gross and insensitive measures of comprehension, even

though these measures revealed small though reliable dif-

ferences i n comprehension performance between natural

and synthetic speech

Additional research is needed to understand the precise

role that practice and familiarity plays i n comprehension

and understanding zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBAAs we noted earlier, performance i n

comprehension tasks improves due to experience in listen-

ing to the synthetic speech Additional research should be

carried out to deal with issues surrounding the nature of

practice and familiarity effects and how the subject’s criteria

and perceptual strategies are modified after listening to

synthetic speech There are still many questions to be

answered: H o w much practice does a listener need? Does

performance using synthetic speech reach the same levels

as with natural speech? Does training reduce the capacity

demands imposed by synthetic speech? These questions

need to be studied in carefully designed laboratory experi-

ments using more sophisticated and sensitive measures of

perception and comprehension zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

B O n - Line Measures of Linguistic Processing

I n order t o understand the moment-to-moment demands

that occur while listening to fluent synthetic speech, we

will need t o use on-line measures that tap the real-time

computational processes used by listeners to perceive and

comprehend fluent speech The use of phoneme- and

word-monitoring tasks which require listeners to respond

while processing the speech input may provide some in-

sight into the covert processes listeners use to understand

synthetic speech (see [ I l l ) Other psycholinguistic tasks

such as mispronunciation detection [9] may also be useful

as well In these tasks, response latencies are used to

measure cognitive processing

C Habituation and Attention to Synthetic Speech

When listening to long passages of synthetic speech, one often experiences difficulty in maintaining focused attention on the linguistic content of the passage While the results obtained in our listening comprehension tests indicated that subjects did, indeed, comprehend these passages quite well, we do not have any evidence that subjects paid full attention to the passages (see also [21]) We also have subjective reports from other experiments suggesting that subjects are “tuning in” and “fading out” as they listen to long passages of synthetic speech Is synthetic speech more fatiguing t o listen to than natural speech? Can a listener fully comprehend a passage when only part of the passage

is processed? How is the listener’s attention allocated in listening t o synthetic speech compared, for example, t o natural speech and how does it change with other demands

on processing capacity in short-term memory? These are all important questions that await further study zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

0 Subjective Evaluation and Listener Preference

I n addition to the quality of the synthetic speech signal itself, another consideration with respect to the evaluation

of synthetic speech concerns the user’s preferences and biases If an individual using a particular text-to-speech system cannot tolerate the sound of the speech or does not trust the information provided by the voice output device, the usefulness of this technology will be reduced With this goal i n mind, we have developed a questionnaire to assess

subjective ratings of synthetic speech [36] Some pre- liminary data have been collected using various types of stimulus materials and several synthesis systems In general,

we have found that listeners’ subjective evaluations of their performance generally correlates well with objective measures of performance Also, we have found that the degree

to which subjects are willing to trust the information provided by synthetic speech is positively correlated with objective measures of performance For the naive user, poor performance predicts low levels of belief in the messages, whereas high levels of accuracy predict a greater degree of confidence

E Research on the Applications of Voice Output Technology

Finally, there are many unanswered questions related to the use of voice response systems in real-world applications Additional research is needed on the use of synthetic speech in settings where it is already being used or could

be used Except for a few studies reporting the use of synthetic speech i n military, business, and industrial settings (see, for example, [17], [52], [56], [61n, most of the reports concerning the use of synthetic speech describe a

new or novel application but they do not evaluate the usefulness or success or failure of the application

VIII SUMMARY AND CONCLUSIONS Evaluating t h e use of voice response systems employing synthetic speech is not just a matter of conducting standardized intelligibility tests Different applications will impose different demands and constraints on observers Thus

1674 PROCEEDINGS O F T H E IEEE, VOL zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA73, NO 11, NOVEMBER 1985

Định dạng
Số trang	12
Dung lượng	1,31 MB