DSpace at VNU: A study on prosody of Vietnamese emotional speech

A Study on Prosody of Vietnamese Emotional SpeechThi Duyen Ngo, The Duy Bui Human Machine Interaction Laboratory University of Engineering and Technology Vietnam National University, Han

Trang 1

A Study on Prosody of Vietnamese Emotional Speech

Thi Duyen Ngo, The Duy Bui Human Machine Interaction Laboratory University of Engineering and Technology Vietnam National University, Hanoi, Vietnam

duyennt@vnu.edu.vn

Abstract

This paper describes the analyses of the prosody of

Viet-namese emotional speech, accomplished to ﬁnd the

rela-tions between prosodic variarela-tions and emotional states in

Vietnamese speech These relations were obtained by

inves-tigating the variations of prosodic features in Vietnamese

emotional speech in comparison with prosodic features of

neutral speech The analyses were performed on a

multi-style emotional speech database which consisted of

Viet-namese sentences uttered in different styles Speciﬁcally,

four emotional styles were considered: happiness, sadness,

cold anger, and hot anger Speech data in the neutral style

were also collected, and prosodic differences of each style

with respect to this neutral baseline were quantiﬁed The

acoustic features related to prosody which were investigated

were fundamental frequency, power, and duration

Accord-ing to the analysis results, for each speaker of the database,

a set of prosodic variation coefﬁcients was produced for

each emotional style This will help for bringing emotions

into Vietnamese synthesized speech, making them more

nat-ural.

Keywords: Vietnamese, Prosody, Acoustic Feature,

Emo-tional Speech.

1 Introduction

Speech is one of the most convenient and important ways

that human uses to communicate with each other

Appar-ently, we do not use only linguistic meaning to convey our

intention and feeling but also consciously or unconsciously

inject our emotion into speech Emotion plays an extremely

important role during our communication For this reason,

attempts, on the acoustic aspect, we need to have detailed knowledge on how acoustic characteristics in voice are re-lated to emotions

The review of literature has shown that there are two types of acoustic cues which have great inﬂuence on emo-tional state in speech One is related to the prosody and the other is related to the voice quality The prosodic change

in an utterance will lead to the change in the perception of emotional speech [6] Therefore, prosody is an important factor needed to be investigated in ﬁnding acoustic feature variations related to emotional states in speech In addi-tion to prosody, voice quality is another acoustic cue that researchers have much focused on In this study, we focus

on prosody analyses, voice quality will be examined in the later work

In this paper, we describe some analyses of the prosody

of Vietnamese emotional speech, accomplished to ﬁnd the relations between prosodic variations and emotional states

in Vietnamese speech Specifically, a Vietnamese emotional speech database was recorded and analysed to verify the correlations and to quantify, for the emotional styles, the prosodic feature variations with respect to the neutral situ-ation The database consisted of Vietnamese sentences ut-tered in five different styles: neutral, happiness, sadness, cold anger, and hot anger According to the analysis results, for each speaker of the database, a set of prosodic variation coefficients was produced for each emotional style This will help for bringing emotions into Vietnamese synthesized speech, making them more natural

The rest of the paper is organized as follows Section 2

2012 Fourth International Conference on Knowledge and Systems Engineering

Trang 2

2 Related Works

In speech, prosody is essentially a collection of factors

that control the pitch, loudness, and rate of speaking The

variations of intonation, rhythm, stress pattern, belong to

what we call the prosody of a sentence Depending on the

emotional states of the speaker, a sentence can be uttered

with different prosodic characteristics Therefore, the

prosodic variations in an utterance have great inﬂuences

on the emotions expressed in speech [6] This is the

reason why prosody is an important factor needed to be

investigated in ﬁnding acoustic feature variations related

to emotional states in speech In the acoustic aspect, the

acoustic cues which are considered signiﬁcant for prosody

are largely extracted from fundamental frequency (F0),

power, and duration

Fundamental frequency (F0)

In the physical meaning aspect, F0 reﬂects the pitch

that is perceived by listeners The F0 contour which

represents the change of F0 in the time domain provides the

information about the accent and intonation of a sentence’s

utterance Such information have great inﬂuence on the

perception of emotional states in speech Therefore, in

the ﬁeld of emotional speech research, F0 is an acoustic

cue which has been studied most frequently and from the

earliest time Erickson [2] has presented a summary on

previous researches that studied to ﬁnd which types of

acoustic cues were related to emotional states in speech

Most works found that the F0 contour had a deeply effect

on emotional states in speech, no matter which method was

used for data collection and languages used

Power

Power which is determined by the volume of air ﬂow

of breath sent out by the lungs primarily reﬂects the

loudness that is perceived by listeners Similar to the F0

contour, the power envelope also affects emotional states

in speech The power can vary widely when the speaker is

in different emotional states The relationship between the

power envelope and emotional states in speech has been

re-ported in a number of proposed researches (e.g [3, 10, 14])

Duration

Duration primarily reﬂects the sound’s time related

fac-tors that listeners perceive, such as pause length, total length

of utterances The same word or same sentence uttered with

different lengths can be perceived differently In the ﬁeld of

emotional speech research, there have been proposed works

which showed an effect of duration on emotional states in

speech, in different languages, i.e., Japanese, English,

Ital-ian [3, 12, 13, 14]

As a monosyllabic and tonal language, Vietnamese has

particular prosodic features in comparison with the one of European language (polysyllabic languages) The Viet-namese prosody is related to rhythm (word’s duration) be-tween words in a word groups or in a compound words, while intonation (raising or lowering the tone by augment-ing or reducaugment-ing the amplitude and/or the frequency of all words) has the global effect on the whole sentence There

is no need to change the intonation of a word in order to highlight it because each word has its own meaning thanks

to one of the six accents Moreover, not like polysyllable languages, there is no question to emphasize a syllable in

a Vietnamese word because each word has only one sylla-ble It is not necessary to pronounce a word in a sentence stronger than the others, except when the speaker would express a special intention (e.g one can pronounce some words stronger and lower than the others in order to make them more important)

Up to now, there have been some proposed works on prosody of Vietnamese speech Le [9] brought out and proved ﬁve hypotheses for Vietnamese speech’s duration basing on analysing 36 ﬁle of 20.815 words read by the broadcasters from several distinctive regions in Vietnam According to [15], factors which impact on the duration of

a Vietnamese phonetic unit are the position, the pitch, and the structure of that unit In [4], Vietnamese compounds and phrasal constructions were investigated for phonetic correlates of lexical stress; acoustic and perceptual charac-teristics of Vietnamese compound words and their phrasal counterparts were reported In [7, 8], Le described some results of researches on acoustic features of Vietnamese speech to help for synthesising Vietnamese speech from text Mac [11] presented a study on Audio-Visual prosodic attitudes in Vietnamese; it showed the relative contribu-tion of audio, visual, and audio-visual informacontribu-tion in at-titude perception and how native and non-native listeners recognize and confuse the attitudes An analysis on speech prosody was also carried out in order to further validate the results of the perception experiments, and to bring out some prosodic characteristics of Vietnamese social affects Al-most all proposed researches have focused on Vietnamese neutral speech; there have been very few ones focusing on Vietnamese emotional speech

3 Emotional Speech Database The emotional speech database which was used for investigating Vietnamese prosodic features consisted of Vietnamese utterances produced by two professional Viet-namese actors, one male and one female The two actors were asked to produce utterances using ﬁve different styles They had to utter 19 sentences in four emotional styles that were: happiness, cold anger, sadness, and hot anger Be-sides, they also recorded the same 19 sentences in a neutral

Trang 3

Table 1 Speciﬁcations of Voice Data

Sampling frequency 22050 Hz

Quantization 16bit

way Consequently, each sentence has one utterance in each

of the ﬁve styles, for both male and female voices

There-fore, there is a total of 190 utterances in the database - a half

for the male voice and the other for the female voice

Sentences were about 8 words long and well

represen-tative of the Vietnamese phonetic alphabet Most of them

were non-sense and had no semantic emotional content and

therefore could not inﬂuence the actors provoking any

par-ticular emotional attitude During the recording sessions,

the actors had to simulate each of emotional styles in turn,

and a director was always present to control their

pronuncia-tion and their prosody to avoid emphatic performances

Sig-nals were recorded in a sound-proof room, high quality

mi-crophone and digital acquisition equipments were used, and

waveforms were digitally acquired with parameters

speci-ﬁed in the Table 1

4 Prosodic Feature Extraction and Analysis

Results

In this section, we describe the prosodic feature

extrac-tion phase and the analysis results Acoustic features which

were investigated were fundamental frequency (F0), power,

and duration The F0 contour and the power envelope were

calculated by using STRAIGHT [5] with a FFT length of

1024 points and a frame rate of 1ms The sampling

fre-quency was 22050 Hz Time duration was manually

speci-ﬁed with the partly support of WaveSurfer [1] A total of 9

acoustic parameters were calculated and analysed in order

to ﬁnd the relations between prosodic variations and

emo-tional state in Vietnamese speech These features are: Three

involved F0 – highest pitch (HP), average pitch (AP) and

pitch range (PR); three involved power envelope – power

range (PWR), average power (APW), and maximum power

(HPW); and three involved duration – total length (TL),

consonant length (CL), mean of pause lengths (MPAU) For

these parameters of each emotional style, the mean values

of variation coefﬁcients with respect to the baseline

(neu-tral style) were calculated These values are reported and

discussed in the next subsections

4.1 Fundamental Frequency - F0

For each utterance, ﬁrstly, the F0 information was

ex-tracted using STRAIGHT [5] Then from this information,

Table 2 Mean variations of F0 parameters for four emotional styles with respect to the neu-tral one.

happy sad cold angry hot angry Male

HP 7.70% -3.11% 6.00% 15.90%

AP 6.88% -3.36% 5.34% 16.51%

PR 33.14% 19.10% 39.97% 41.90% Female

HP 10.56% -0.47% 7.63% 14.42%

AP 7.25% -0.35% 5.66% 13.01%

PR 49.35% 31.26% 41.65% 56.89%

some acoustic parameters related to the F0 contour were measured These parameters were highest pitch (HP), av-erage pitch (AP), and pitch range (PR) The mean varia-tion values of these parameters for both male and the fe-male voices are reported in the table 2 The analysis re-sults showed that three of the four emotional styles, namely happiness, cold anger, hot anger, had increase values with respect to the neutral case, for all parameters, and for both two speakers In there, the hot angry style had the largest variations in the F0 contour; all of three parameters related

to F0 in this style had biggest increase values On the other hand, in the sad style, F0 related parameters varied in a dif-ferent way Speciﬁcally, AP and HP decreased while PR increased with respect to the neutral case, for both male and female voices Actually, different analysis results were found among speakers’ voices With the female voice, the increase values of three parameters in the happy style were larger than those in the cold angry style However, with the male voice, AP and HP in the happy style had bigger in-crease values than those in the cold angry style while PR had a smaller increase value Another difference was that in the sad style, the decrease values of AP and HP of the fe-male voice were much smaller than those of the fe-male voice; these two parameters of the female voice decreased almost inappreciably On the contrary, the increase value of PR of the female voice was quite much bigger than the one of the male voice in the sad style These differences were due to the fact that the speakers expressed emotions in different ways and with different intensities

4.2 Power

The power envelope was measured in a way similar to that for the F0 contour Power information was ﬁrstly ex-tracted using STRAIGHT [5] and then acoustic parameters related to the power envelope were calculated The acous-tic parameters considered were: maximum power (HPW), average power (APW), and power range (PWR) Table 3

Trang 4

Table 3 Mean variations of power parameters

for four emotional styles with respect to the

neutral one.

APW 20.81% -16.38% 4.38% 11.48%

HPW 18.87% -6.95% 8.53% 14.60%

PWR 8.38% -6.42% 14.37% 20.45%

Female

APW 28.82% -13.15% 38.82% 48.58%

HPW 14.84% -8.95% 25.98% 34.31%

PWR 11.32% -7.97% 18.11% 25.70%

presents the mean variation values of these parameters for

both male and female voice With respect to the neutral

style, all of three parameters increased in the happy, cold

angry, hot angry styles while they decreased in the sad

one The high activation styles were characterized by bigger

variation values and signiﬁcant power peaks sometimes

oc-curred in the ﬁnal parts of the sentences too Similar to the

F0 contour, there were some differences in variation

val-ues between the two speakers’ voices For example, with

the male voice, APW and HPW in the happy style had

big-ger increase values than those in the hot angry style By

contrast, with the female voice, these two parameters had

smaller increase values in comparison between the happy

and hot angry styles The reasons for these differences are

the same as those for the differences in the F0 analysis

re-sults

4.3 Time Duration

For each utterance, the information of time

segmenta-tion was manually measured ﬁrst The measurement

in-cluded phoneme number, time (ms), and vowel The

du-ration of all phonemes (time), both consonants and vowels,

as well as pauses, were manually speciﬁed with the partly

support of WaveSurfer [1] Figure 1 illustrate an

exam-ple of time segmentation In the table, the ﬁrst row

indi-cates the phonemes; the second row represents the order of

the phonemes, noted by -1 before the ﬁrst phoneme; the

third row indicates the start time of the next phoneme; the

fourth row shows whether the phonemes are consonant or

vowel: 1 – vowel, 2 – consonant Basing on this table of

time segmentation, the following parameters related to

du-ration were measured: mean of pause lengths (MPAU),

to-tal length (TL), consonant length (CL) The mean variation

values of these parameters are reported in the table 2, for

both male and female voice Total utterance length

parame-ter increased in all of four emotional style while consonant

length parameter increased in the sad style and decreased in

the other styles With the mean of pause length parameter,

Figure 1 An example of time segmentation.

Table 4 Mean variations of duration parame-ters for four emotional styles with respect to the neutral one.

MPAU -26.57% 50.81% 57.83% 62.83%

CL -16.25% 15.49% -17.82% -24.57%

TL 16.09% 19.62% 14.03% 19.26% Female

MPAU -18.01% 49.11% -20.49% 53.03%

CL -13.65% 13.29% -15.80% -23.44%

TL 11.65% 20.32% 11.47% 25.78%

there was a difference between the male voice and the fe-male voice: in the cold angry style, this parameter increased for the male voice but decreased for the female voice The reason for this difference is that the two speaker expressed the cold angry emotion in different ways

5 Conclusion and future works

This paper has presented the results of some analyses of the prosody of Vietnamese emotional speech The analyses were perform on an emotional speech database which con-sists of Vietnamese sentences uttered in five different styles The relations between prosodic variations and emotional states in Vietnamese speech were obtained by investigating the variations of prosodic features of Vietnamese emotional speech in comparison with those of neutral speech Acous-tic parameters related to fundamental frequency, power, and duration were measured and analysed for all utterances in the database According to the analysis results, a set of prosodic variation coefficients was produced for each emo-tional style and for each speaker of the database This will help for bringing emotions into Vietnamese synthesized speech, making them more natural Further studies are nec-essary to find the relations between acoustic spectrum vari-ations and emotional states in Vietnamese speech In the future, we will perform this work and then use obtained re-sults to construct a Vietnamese emotional speech synthesis system

Trang 5

6 Acknowledgement

This work is supported by the project Towards a Model

of an ”Intelligent Ofﬁce Enviroment”, No QGTD.10.23.

References

[1] Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html [2] D Erickson Expressive speech: Production, perception

and application to speech synthesis Acoust Sci & Tech,

26:317–325, 2005.

[3] G L Huttar Relations between prosodic variables and emo-tions in normal american english utterances. Journal of Speech and Hearing Research, 11:481–487.

[4] J Ingram and T Nguyen Stress, tone and word prosody in

vietnamese compounds Proceedings of the 11th Australian

International Conference on Speech Science & Technology,

pages 193–198, 2006.

[5] H Kawahara, I Masuda-Katsuse, and A de Cheveigne Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in

sounds Speech Communication, 27:187–207, 1999.

[6] R D Kent and C Read Acoustic Analysis of Speech San

Diego: Singular Publishing Group, 1992.

[7] H M Le and K H Le Analysis and synthesis for

dura-tion feature of vietnamese The 6th Nadura-tional Conference in

Information Technology, Thainguyen, Vietnam, 2003.

[8] H M Le and T N Quach Some results in phonetic analysis

to vietnamese text-to-speech synthesis based on rules

Jour-nal on Information and Communication Technology, 2006.

[9] T H Le, A V Nguyen, V H Truong, V H Bui, and D Le.

A study on vietnamese prosody New Challenges for

Intelli-gent Information and Database Systems, 351:63–73, 2011.

[10] L Leinonen Expression of emotional-motivational

con-notations with a one-word utterance J Acoust Soc Am.,

102:1853–1863, 1997.

[11] D K Mac, E Castelli, V Auberg, and A Rilliard How viet-namese attitudes can be recognized and confused:

Cross-cultural perception and speech prosody analysis

Interna-tional Conference on Asian Language Processing, pages

220–223, 2011.

[12] K Maekawa Phonetic and phonological characteristics of

paralinguistic information in spoken japanese Proc Int.

Conf Spoken Language Processing, pages 635–638, 1998.

[13] M D Pell Inﬂuence of emotion and focus location on

prosody in matched statements and questions J Acoust.

Soc Am., 109:1668–1680, 2001.

[14] R W H G G T Scherer K R., Banse Vocal cues in

emotion encoding and decoding Motivation and Emotion,

15:123–148, 1991.

[15] D D Tran, E Castelli, J.-F Serignat, and V B Le Analy-sis and modeling of syllable duration for vietnamese speech

synthesis O-COCOSDA, 2007.

Định dạng
Số trang	5
Dung lượng	120,43 KB