A Study on Prosody of Vietnamese Emotional SpeechThi Duyen Ngo, The Duy Bui Human Machine Interaction Laboratory University of Engineering and Technology Vietnam National University, Han
Trang 1A Study on Prosody of Vietnamese Emotional Speech
Thi Duyen Ngo, The Duy Bui Human Machine Interaction Laboratory University of Engineering and Technology Vietnam National University, Hanoi, Vietnam
duyennt@vnu.edu.vn
Abstract
This paper describes the analyses of the prosody of
Viet-namese emotional speech, accomplished to find the
rela-tions between prosodic variarela-tions and emotional states in
Vietnamese speech These relations were obtained by
inves-tigating the variations of prosodic features in Vietnamese
emotional speech in comparison with prosodic features of
neutral speech The analyses were performed on a
multi-style emotional speech database which consisted of
Viet-namese sentences uttered in different styles Specifically,
four emotional styles were considered: happiness, sadness,
cold anger, and hot anger Speech data in the neutral style
were also collected, and prosodic differences of each style
with respect to this neutral baseline were quantified The
acoustic features related to prosody which were investigated
were fundamental frequency, power, and duration
Accord-ing to the analysis results, for each speaker of the database,
a set of prosodic variation coefficients was produced for
each emotional style This will help for bringing emotions
into Vietnamese synthesized speech, making them more
nat-ural.
Keywords: Vietnamese, Prosody, Acoustic Feature,
Emo-tional Speech.
1 Introduction
Speech is one of the most convenient and important ways
that human uses to communicate with each other
Appar-ently, we do not use only linguistic meaning to convey our
intention and feeling but also consciously or unconsciously
inject our emotion into speech Emotion plays an extremely
important role during our communication For this reason,
attempts, on the acoustic aspect, we need to have detailed knowledge on how acoustic characteristics in voice are re-lated to emotions
The review of literature has shown that there are two types of acoustic cues which have great influence on emo-tional state in speech One is related to the prosody and the other is related to the voice quality The prosodic change
in an utterance will lead to the change in the perception of emotional speech [6] Therefore, prosody is an important factor needed to be investigated in finding acoustic feature variations related to emotional states in speech In addi-tion to prosody, voice quality is another acoustic cue that researchers have much focused on In this study, we focus
on prosody analyses, voice quality will be examined in the later work
In this paper, we describe some analyses of the prosody
of Vietnamese emotional speech, accomplished to find the relations between prosodic variations and emotional states
in Vietnamese speech Specifically, a Vietnamese emotional speech database was recorded and analysed to verify the correlations and to quantify, for the emotional styles, the prosodic feature variations with respect to the neutral situ-ation The database consisted of Vietnamese sentences ut-tered in five different styles: neutral, happiness, sadness, cold anger, and hot anger According to the analysis results, for each speaker of the database, a set of prosodic variation coefficients was produced for each emotional style This will help for bringing emotions into Vietnamese synthesized speech, making them more natural
The rest of the paper is organized as follows Section 2
2012 Fourth International Conference on Knowledge and Systems Engineering
Trang 22 Related Works
In speech, prosody is essentially a collection of factors
that control the pitch, loudness, and rate of speaking The
variations of intonation, rhythm, stress pattern, belong to
what we call the prosody of a sentence Depending on the
emotional states of the speaker, a sentence can be uttered
with different prosodic characteristics Therefore, the
prosodic variations in an utterance have great influences
on the emotions expressed in speech [6] This is the
reason why prosody is an important factor needed to be
investigated in finding acoustic feature variations related
to emotional states in speech In the acoustic aspect, the
acoustic cues which are considered significant for prosody
are largely extracted from fundamental frequency (F0),
power, and duration
Fundamental frequency (F0)
In the physical meaning aspect, F0 reflects the pitch
that is perceived by listeners The F0 contour which
represents the change of F0 in the time domain provides the
information about the accent and intonation of a sentence’s
utterance Such information have great influence on the
perception of emotional states in speech Therefore, in
the field of emotional speech research, F0 is an acoustic
cue which has been studied most frequently and from the
earliest time Erickson [2] has presented a summary on
previous researches that studied to find which types of
acoustic cues were related to emotional states in speech
Most works found that the F0 contour had a deeply effect
on emotional states in speech, no matter which method was
used for data collection and languages used
Power
Power which is determined by the volume of air flow
of breath sent out by the lungs primarily reflects the
loudness that is perceived by listeners Similar to the F0
contour, the power envelope also affects emotional states
in speech The power can vary widely when the speaker is
in different emotional states The relationship between the
power envelope and emotional states in speech has been
re-ported in a number of proposed researches (e.g [3, 10, 14])
Duration
Duration primarily reflects the sound’s time related
fac-tors that listeners perceive, such as pause length, total length
of utterances The same word or same sentence uttered with
different lengths can be perceived differently In the field of
emotional speech research, there have been proposed works
which showed an effect of duration on emotional states in
speech, in different languages, i.e., Japanese, English,
Ital-ian [3, 12, 13, 14]
As a monosyllabic and tonal language, Vietnamese has
particular prosodic features in comparison with the one of European language (polysyllabic languages) The Viet-namese prosody is related to rhythm (word’s duration) be-tween words in a word groups or in a compound words, while intonation (raising or lowering the tone by augment-ing or reducaugment-ing the amplitude and/or the frequency of all words) has the global effect on the whole sentence There
is no need to change the intonation of a word in order to highlight it because each word has its own meaning thanks
to one of the six accents Moreover, not like polysyllable languages, there is no question to emphasize a syllable in
a Vietnamese word because each word has only one sylla-ble It is not necessary to pronounce a word in a sentence stronger than the others, except when the speaker would express a special intention (e.g one can pronounce some words stronger and lower than the others in order to make them more important)
Up to now, there have been some proposed works on prosody of Vietnamese speech Le [9] brought out and proved five hypotheses for Vietnamese speech’s duration basing on analysing 36 file of 20.815 words read by the broadcasters from several distinctive regions in Vietnam According to [15], factors which impact on the duration of
a Vietnamese phonetic unit are the position, the pitch, and the structure of that unit In [4], Vietnamese compounds and phrasal constructions were investigated for phonetic correlates of lexical stress; acoustic and perceptual charac-teristics of Vietnamese compound words and their phrasal counterparts were reported In [7, 8], Le described some results of researches on acoustic features of Vietnamese speech to help for synthesising Vietnamese speech from text Mac [11] presented a study on Audio-Visual prosodic attitudes in Vietnamese; it showed the relative contribu-tion of audio, visual, and audio-visual informacontribu-tion in at-titude perception and how native and non-native listeners recognize and confuse the attitudes An analysis on speech prosody was also carried out in order to further validate the results of the perception experiments, and to bring out some prosodic characteristics of Vietnamese social affects Al-most all proposed researches have focused on Vietnamese neutral speech; there have been very few ones focusing on Vietnamese emotional speech
3 Emotional Speech Database The emotional speech database which was used for investigating Vietnamese prosodic features consisted of Vietnamese utterances produced by two professional Viet-namese actors, one male and one female The two actors were asked to produce utterances using five different styles They had to utter 19 sentences in four emotional styles that were: happiness, cold anger, sadness, and hot anger Be-sides, they also recorded the same 19 sentences in a neutral
Trang 3Table 1 Specifications of Voice Data
Sampling frequency 22050 Hz
Quantization 16bit
way Consequently, each sentence has one utterance in each
of the five styles, for both male and female voices
There-fore, there is a total of 190 utterances in the database - a half
for the male voice and the other for the female voice
Sentences were about 8 words long and well
represen-tative of the Vietnamese phonetic alphabet Most of them
were non-sense and had no semantic emotional content and
therefore could not influence the actors provoking any
par-ticular emotional attitude During the recording sessions,
the actors had to simulate each of emotional styles in turn,
and a director was always present to control their
pronuncia-tion and their prosody to avoid emphatic performances
Sig-nals were recorded in a sound-proof room, high quality
mi-crophone and digital acquisition equipments were used, and
waveforms were digitally acquired with parameters
speci-fied in the Table 1
4 Prosodic Feature Extraction and Analysis
Results
In this section, we describe the prosodic feature
extrac-tion phase and the analysis results Acoustic features which
were investigated were fundamental frequency (F0), power,
and duration The F0 contour and the power envelope were
calculated by using STRAIGHT [5] with a FFT length of
1024 points and a frame rate of 1ms The sampling
fre-quency was 22050 Hz Time duration was manually
speci-fied with the partly support of WaveSurfer [1] A total of 9
acoustic parameters were calculated and analysed in order
to find the relations between prosodic variations and
emo-tional state in Vietnamese speech These features are: Three
involved F0 – highest pitch (HP), average pitch (AP) and
pitch range (PR); three involved power envelope – power
range (PWR), average power (APW), and maximum power
(HPW); and three involved duration – total length (TL),
consonant length (CL), mean of pause lengths (MPAU) For
these parameters of each emotional style, the mean values
of variation coefficients with respect to the baseline
(neu-tral style) were calculated These values are reported and
discussed in the next subsections
4.1 Fundamental Frequency - F0
For each utterance, firstly, the F0 information was
ex-tracted using STRAIGHT [5] Then from this information,
Table 2 Mean variations of F0 parameters for four emotional styles with respect to the neu-tral one.
happy sad cold angry hot angry Male
HP 7.70% -3.11% 6.00% 15.90%
AP 6.88% -3.36% 5.34% 16.51%
PR 33.14% 19.10% 39.97% 41.90% Female
HP 10.56% -0.47% 7.63% 14.42%
AP 7.25% -0.35% 5.66% 13.01%
PR 49.35% 31.26% 41.65% 56.89%
some acoustic parameters related to the F0 contour were measured These parameters were highest pitch (HP), av-erage pitch (AP), and pitch range (PR) The mean varia-tion values of these parameters for both male and the fe-male voices are reported in the table 2 The analysis re-sults showed that three of the four emotional styles, namely happiness, cold anger, hot anger, had increase values with respect to the neutral case, for all parameters, and for both two speakers In there, the hot angry style had the largest variations in the F0 contour; all of three parameters related
to F0 in this style had biggest increase values On the other hand, in the sad style, F0 related parameters varied in a dif-ferent way Specifically, AP and HP decreased while PR increased with respect to the neutral case, for both male and female voices Actually, different analysis results were found among speakers’ voices With the female voice, the increase values of three parameters in the happy style were larger than those in the cold angry style However, with the male voice, AP and HP in the happy style had bigger in-crease values than those in the cold angry style while PR had a smaller increase value Another difference was that in the sad style, the decrease values of AP and HP of the fe-male voice were much smaller than those of the fe-male voice; these two parameters of the female voice decreased almost inappreciably On the contrary, the increase value of PR of the female voice was quite much bigger than the one of the male voice in the sad style These differences were due to the fact that the speakers expressed emotions in different ways and with different intensities
4.2 Power
The power envelope was measured in a way similar to that for the F0 contour Power information was firstly ex-tracted using STRAIGHT [5] and then acoustic parameters related to the power envelope were calculated The acous-tic parameters considered were: maximum power (HPW), average power (APW), and power range (PWR) Table 3
Trang 4Table 3 Mean variations of power parameters
for four emotional styles with respect to the
neutral one.
happy sad cold angry hot angry Male
APW 20.81% -16.38% 4.38% 11.48%
HPW 18.87% -6.95% 8.53% 14.60%
PWR 8.38% -6.42% 14.37% 20.45%
Female
APW 28.82% -13.15% 38.82% 48.58%
HPW 14.84% -8.95% 25.98% 34.31%
PWR 11.32% -7.97% 18.11% 25.70%
presents the mean variation values of these parameters for
both male and female voice With respect to the neutral
style, all of three parameters increased in the happy, cold
angry, hot angry styles while they decreased in the sad
one The high activation styles were characterized by bigger
variation values and significant power peaks sometimes
oc-curred in the final parts of the sentences too Similar to the
F0 contour, there were some differences in variation
val-ues between the two speakers’ voices For example, with
the male voice, APW and HPW in the happy style had
big-ger increase values than those in the hot angry style By
contrast, with the female voice, these two parameters had
smaller increase values in comparison between the happy
and hot angry styles The reasons for these differences are
the same as those for the differences in the F0 analysis
re-sults
4.3 Time Duration
For each utterance, the information of time
segmenta-tion was manually measured first The measurement
in-cluded phoneme number, time (ms), and vowel The
du-ration of all phonemes (time), both consonants and vowels,
as well as pauses, were manually specified with the partly
support of WaveSurfer [1] Figure 1 illustrate an
exam-ple of time segmentation In the table, the first row
indi-cates the phonemes; the second row represents the order of
the phonemes, noted by -1 before the first phoneme; the
third row indicates the start time of the next phoneme; the
fourth row shows whether the phonemes are consonant or
vowel: 1 – vowel, 2 – consonant Basing on this table of
time segmentation, the following parameters related to
du-ration were measured: mean of pause lengths (MPAU),
to-tal length (TL), consonant length (CL) The mean variation
values of these parameters are reported in the table 2, for
both male and female voice Total utterance length
parame-ter increased in all of four emotional style while consonant
length parameter increased in the sad style and decreased in
the other styles With the mean of pause length parameter,
Figure 1 An example of time segmentation.
Table 4 Mean variations of duration parame-ters for four emotional styles with respect to the neutral one.
happy sad cold angry hot angry Male
MPAU -26.57% 50.81% 57.83% 62.83%
CL -16.25% 15.49% -17.82% -24.57%
TL 16.09% 19.62% 14.03% 19.26% Female
MPAU -18.01% 49.11% -20.49% 53.03%
CL -13.65% 13.29% -15.80% -23.44%
TL 11.65% 20.32% 11.47% 25.78%
there was a difference between the male voice and the fe-male voice: in the cold angry style, this parameter increased for the male voice but decreased for the female voice The reason for this difference is that the two speaker expressed the cold angry emotion in different ways
5 Conclusion and future works
This paper has presented the results of some analyses of the prosody of Vietnamese emotional speech The analyses were perform on an emotional speech database which con-sists of Vietnamese sentences uttered in five different styles The relations between prosodic variations and emotional states in Vietnamese speech were obtained by investigating the variations of prosodic features of Vietnamese emotional speech in comparison with those of neutral speech Acous-tic parameters related to fundamental frequency, power, and duration were measured and analysed for all utterances in the database According to the analysis results, a set of prosodic variation coefficients was produced for each emo-tional style and for each speaker of the database This will help for bringing emotions into Vietnamese synthesized speech, making them more natural Further studies are nec-essary to find the relations between acoustic spectrum vari-ations and emotional states in Vietnamese speech In the future, we will perform this work and then use obtained re-sults to construct a Vietnamese emotional speech synthesis system
Trang 56 Acknowledgement
This work is supported by the project Towards a Model
of an ”Intelligent Office Enviroment”, No QGTD.10.23.
References
[1] Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html [2] D Erickson Expressive speech: Production, perception
and application to speech synthesis Acoust Sci & Tech,
26:317–325, 2005.
[3] G L Huttar Relations between prosodic variables and emo-tions in normal american english utterances. Journal of Speech and Hearing Research, 11:481–487.
[4] J Ingram and T Nguyen Stress, tone and word prosody in
vietnamese compounds Proceedings of the 11th Australian
International Conference on Speech Science & Technology,
pages 193–198, 2006.
[5] H Kawahara, I Masuda-Katsuse, and A de Cheveigne Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in
sounds Speech Communication, 27:187–207, 1999.
[6] R D Kent and C Read Acoustic Analysis of Speech San
Diego: Singular Publishing Group, 1992.
[7] H M Le and K H Le Analysis and synthesis for
dura-tion feature of vietnamese The 6th Nadura-tional Conference in
Information Technology, Thainguyen, Vietnam, 2003.
[8] H M Le and T N Quach Some results in phonetic analysis
to vietnamese text-to-speech synthesis based on rules
Jour-nal on Information and Communication Technology, 2006.
[9] T H Le, A V Nguyen, V H Truong, V H Bui, and D Le.
A study on vietnamese prosody New Challenges for
Intelli-gent Information and Database Systems, 351:63–73, 2011.
[10] L Leinonen Expression of emotional-motivational
con-notations with a one-word utterance J Acoust Soc Am.,
102:1853–1863, 1997.
[11] D K Mac, E Castelli, V Auberg, and A Rilliard How viet-namese attitudes can be recognized and confused:
Cross-cultural perception and speech prosody analysis
Interna-tional Conference on Asian Language Processing, pages
220–223, 2011.
[12] K Maekawa Phonetic and phonological characteristics of
paralinguistic information in spoken japanese Proc Int.
Conf Spoken Language Processing, pages 635–638, 1998.
[13] M D Pell Influence of emotion and focus location on
prosody in matched statements and questions J Acoust.
Soc Am., 109:1668–1680, 2001.
[14] R W H G G T Scherer K R., Banse Vocal cues in
emotion encoding and decoding Motivation and Emotion,
15:123–148, 1991.
[15] D D Tran, E Castelli, J.-F Serignat, and V B Le Analy-sis and modeling of syllable duration for vietnamese speech
synthesis O-COCOSDA, 2007.