doclaration.51 Figure 3-17 Averaged Oq contours of SFP /da3/ of 10 male speakers: irritation...53 Figure 3-18 Proposed model for combination of speaker altitude, voice auaity and glotta
Trang 1MINISTRY OF EDUCATION AND TRAINING
Nguyen Thi Lan
GLOTTALIZED TONES IN VIETNAMESE EXPRESSIVE
Trang 2
MINISTRY OF EDUCATION AND TRAINING ILANOI UNIVERSITY OF SCIENCE AND TECIINOLOGY
Nguyen Thi Lan
TONES IN VIETNAMESE EXPRESSIVE SPEECH
MAS'TER ‘THESIS OF SCIENCE
INFORMATION TECHNOLOGY
SUPERVISOR:
Dr Tran Do Dat
Hanoi 2015
Trang 3COMMITMENT
T commit myself to be the person who was responsible for conducting this study All reference figures were extracted with clear derivation Ihe presented results are
‘truthful and have not published in any other person’s work
NGUYEN Thi Lan
Trang 4ACKNOWLEDGEMENT
‘This is the second time that 1 sit here, at Hanoi University of Science and
Technology, with a great honor to write these grateful wards to people who have
been supporting me since the first, moment T erfered the university The first
acknowledgement was written in my graduation thesis 2.5 years ago and today, this
one just awakes a special emotion in me
I wish to thank all my professors and colleagues at School of Information and Communication Technology and MICA International Research Institute, who have helped me with generous supports Their advice and knowledge they imparted to me
are gratefully appreciated, inspiring me a lot to finish this thesis
Special thanks to my supervisor Dr Tran Do Dat and colleagues of Speech Communication Department, MICA Tnstitute, including Dr Do Thi Ngoe Diep, Nguyen Thi Thu Trang, Nguyen Tuan Ninh, Tran Thi Anh Xuan, Dr Nguyen Viet Son, Dr Nguyen Cong Phuong, Nguyen Dục Anh and Nguyen Tie Thanh, for their advice and encouragement they gave to me, especially Dr Mac Dang Khoa
and Dr Alexis Michaud for their thorough review and invaluable suggestions
Another thanks for two thesis reviewers including Assoc Prof Truong Ninh
‘Thuan (VNU) and Dr Vu Thi Luong Giang (SOICT, MUST) for their worth comments which helped the thesis’s presentation become much better
Special thanks to my family and friends who always stand by me, lifting me up when I was down Without them, my life would be nonsense!
NGUYEN Thị Lan
tà
Trang 5CONTENTS COMMITMENT
Chapter 3 ANALYSING VARIATION IN REALIZATION OF GLOTTALIZED
Trang 63.3 Proposalz lora fillscalo zludy and snustical sans of [Hhomlie Bạn based on EGG
APPENDIX B: FIGURES OF AVERAGED FU&0q CONTOURS OF EACH
SPEAKER WITH STANDARD DEVIATION FOR THE USED ATTITUDES
Trang 7LIST OF ABBREVIATIONS SEP — Sentence-final particle
KGG Hlectroglottography
DEGG — The derivative of the electroglottography signal
IPA — International phonetic association
DbCPA Derivative-Hlectroglottographic Closure Peak Amplitude
X-SAMPA ~The Extended Speech Assessment Methods Phonetic Alphabet
FO Fundamental frequency
Trang 8LIST OF TABLES
Table 1-3 Phonetic characlerislies of Viclnamese inilial consonants 14 Table 1-4 Phonetic characteristics of Vietnamese final consonants 14 'Table 1-5 Phonetic characteristics of Vietnamese vowels/diphthongs 18 Table 1-6 Summarized description of & tenes of Viotnamese 16
‘Table 2-1 Intended attitudes G1012 neo ãM
‘Table 3-1 Statistics of Mechanism I-A/Pressed Voiee/Mechanism 1-13 of tone 6a
Table 3-2 Statistics of Mechanism I-A/Creaky Voice/Mechanism I-B of tone 3 with
Table 3-3 Statistics of Mechanism I ‘AlPressed Voice/Mechanism IBof tone 3 with
Trang 9LIST OF FIGURES Vigure 1-1 Schematic diagram of Llanoi Vietnamese tones (Michaud, 2004a) 8 Figure 2-1 Speaker F7 (lel) and M10 (right) in the recording booth
Figure 2-2 Sentence and Syllable Level Annotation with SoundForge (above) oui
Figure 3-1 Visualization of closing instant synchronized with EGG (above) and
DEGG (below) signals (enrich, 2001) .csccessccssssessssessseesssasveee 34 Figure 3-2 Visualization of opening instant synchronized with EGG (above) and
DGG (balow) signals (1lenrich, 2001) 35
Figure 3-3 Fxample of KGG and DEGG sigruls with indication of glottis closure
Figure 3-4 lwo realizations of glottalization on SKP /aGa/ with two attitudes (a)
Figure 3-5 Average curves of I'0 and Oq over 6 tokens of /a6a/, speaker M7 .38 Figure 3-6 Two realizations of glolativalion on SFP /da3/ of two attitudes, (a)
declarative/neutral; (b): irritation Speaker M6 sone dD Figure 3-7 Average curves of FO and Og over 6 tokens of Ad33/, speaker M6 40 Figure 3-8 Determining mechanisms of voice based on DECPA and FO parameters (each point of !U&Oq contour corresponds with a cycle on DEGG signal) 42 Figure 3-9 Determining the duration of pressed voice based on local dipping of Oq
(each point of !'080q contour corresponds with a cycle on DUOG signal) 44 Figure 3-10 The tool for detection integrated three analysis modules 45 Figure 3-11 Some visually illustrative figures of creaky voice from the detection
Figure 3-1 Averaged Oq contours of SFP /a6a/ of 10 male speakers: surprise (left)
Figure 3-15 Averaged FO contours of SFP /ci33/ of 10 male speakers irritation and
Tigure 3-16 Avereged Òq contours of SEP /da3/ of 10 male speakers doclaration.51 Figure 3-17 Averaged Oq contours of SFP /da3/ of 10 male speakers: irritation 53 Figure 3-18 Proposed model for combination of speaker altitude, voice auaity and
glottalized tone in Vietnamese expressive speech processing 54
Trang 10INTRODUCTION Nowadays, using speech in human-machine interaction is gradually becoming the
major rend which promises Lo replace traditional commumicalion methods: mouse,
keyboard, sercen, for example However, a high-quality human-machine interaction system that can completely behave as a human being, currently, is still just beyond our reach One of the primary reasons is because of the lack of advanced techniques that enable precisely processing (either synthesis or recognition) the expression of
human utterances
‘The expression, in other words, refers to attitudinal or emotional aspects when someone speaks, which hereby can convey much linguistic information In this
perspective, the attitudinal aspects in speaker utterances, also called speaker
altiludes are of no sinall importance If' speaker altiludes play such an iraportant role
in the interactions between humans, they need to be taken into account in the
imleraction between humans and machines (Picard, 1997) Atitudial mformalion in
a spoken utterance can be lexically oncoded but can also be conveyed by intonation,
including modifications of voice quality (Seibert, 2003)
However, the modification of those features in Vietnamese is quite complex since
ít has the interplay between intonation and tones: especially, the complexity even becomes much more complicated when dealing with glottalized tones which are tone ngd and tone nding lurthermore, in expressive speech, how the interplay can
be expressed, what its realization will be and with which mechanisms, are several
among many questions sel oul
Among eight tones in Vietnamese, tone ngd and tone ngng are considered the most complicated since they have glotlalizaion phenomenon accompanied Ta mos! cascs, with simpler tones, the interaction between intonation and tone simplifies to
be described by the changing in fundamental frequency, intensity or duration
pararicters, whercas wilh these two glollalived tones, these parameters are exactly
Trang 11not sufficient since their glottalization phenomenon can vary a lot depending on context Obviously, there have been many researches that tried to approach this but, actually, they seem to avoid the mosl complicated aspect which is gÌotlalizalion
phenomenon in Vietnamese
Therefore, towards application in Vietnamese speech processing, the ultimate
objective is to provide sufficient detail of the interplay between glottalized tones and intonation for both automatic speech recognition system and text-to-speech
syslem in encoding and decoding allitudinal information in speaker's ullerances
Specific contents of the thesis are as follows
Chapter 1 presenls overview of phonetic and phonology, tone and the expression
of attitudes in Victnamese as well as cxisting issues that nced to be dealt with and
thesis’s approach
Chapter 2 and 3 show proposed methods for data acquisition and analysis which was based on LIGG and DEGG signal in order to clarify the interaction mechanism between glottalized tones and expressive speech intonation
Finally, Chapter 4 gives some conclusions and perspectives for expanding the study to cover wider range of speaker attitudes and more tones in Vietnamese
‘The obtained results include:
v Thesis Report
~ Attitudinal Corpus: recorded with L0 males and 10 females
¥ Method and tool for detection and quantification of Creaky and Pressed
Voice in Surprise/Irritation/Declaration Attitude
¥ 1 International Conference Paper: INTERSPEECH 2013
Trang 12*⁄_1 NaGonal Journal Papor: Journal oŸ 8eienee & Technology of Techaical Universities in Vietnam, 101 (2014)
Trang 13Chapter1 OVERVIEW
Similar to any other language, Vietnamese has a rich system of consonants and
vowels together with various regulations of forming meaningful words However, one of the special charactenstics which make it even more attractive in the eyes of
researchers is thal 1 has 4 complex lexical ones system So, why tL is evaluated 1a
be complex and why the topic focusing on studying its tones system was chosen as
major point of the thesis Furthermore, the author also conducted a research on
expressive speech and emphasized that the relationship belween tonal realization
and attitudinal expression in Vietnamese should be taken seriously, is this a unique point that distinguishes Vietnamese from others? In this part, a brief introduction will be presented to bring you a clear look of Victnamese phonetics and phonology Additionally, the section of raising issues will clarify the questions above as well as
our interests
1.1 Background knowledge
1.1.1 Vietnamese phonetics and phonology
There has been many works involving in studying Vietnamese phonology system for years such as (Doan, 1977), (Nguyen, Edmondson & Jerokl, 1998), (Hwa-
Froclich, Hodsen, & Edwards, 2002), (Nguyen, Carre, & Castelli, 2008), (Michaud
& André-Georges, 2010) and (Hajek, 2008) Among these, there exists different coneepls in establishing Vietnamese phonology system, but in genoral, the list of consonants and vowels in Viemamese can be summarized respectively as in Table 1-1 and Table 1-2 in both IPA-symbol system and X-SAMPA-symbol system
(Doan, 1977)
Where
Ail if initial followed by consonani, 8 or nothing
2; final only for this phoneme
3: final except after u, 0, 6
Trang 14: ngh - imitial on|y (before ¡, e, ê); ng— imitial except before ï, e, ê
Ý: gh- initial before l, e, ê; ¢— initial except before , e, ê
+ initial except before i, e, é, y; final after u, 0, 6
12
Trang 15Table 1-2 Viemamese vowels/diphthongs
Short vowels
or several phonemes just follow certain vowels Besides, there are only 9 long
vowels, 4 short vowels and 3 diphthongs which are combination of single vowels
Table 1-3 and Table 1-4 describe phonetic characteristics of these consonants In
these tables, the format to represent phonemes is “IPA-symbol (X-SAMPA-
symbol)”, where the (XSAMPA symbol) part disappears if it is the same as the IPA- symbol For two variants of /p/ and /k/, final consonants after /u 9 o/, /nm/ is labial-
velar nasal while /kp/ is voiceless labial-velar plosive (Hajek, 2008) (Doan, 1977)
Trang 16Table 1-3 Phonetic characteristics of Vietamese initial consonants
Green bold consonants: Not exist in Northern dialect Besides, for this dialect:
~ ch- /c/ and tr- {/ are pronounced alike
- d-, gi- /2/ and r- /=/ are pronounced alike
~ x« /8/ and s- /s/ are pronounced alike
Table 1-4 Phonetic characteristics of Vietnamese final consonants
Table 1-5 presents the phonetic characteristics of 16 vowels and diphthongs in
Vietnamese Similar to other languages, they are distinguished from each other
based on which part of the tongue is involved (front, central, back) and how high
the tongue is when the sound is produced (high, mid, low)
14
Trang 17Table 1-5 Phonetic characteristics of Viemamese vowels/diphthongs
Above is a brief introduction on Vietnamese phonetics and phonology, the next
session will present one of the problems that is always a challenge to anyone who want to approach Vietnamese — Vietnamese tones system
1.1.2 The phonetic characteristics of complex lexical tone system in
Vietnamese
Vietnamese is a tonal language, that is the meaning of each word depends on the
"tone" in which it is pronounced Many other languages also use tones, such as
Mandarin and Thai However, it can be said that Vietnamese tone system is
relatively complex in comparison with the others since it has a six-tone paradigm
for sonorant-final syllables, and a two-tone paradigm for obstruent-final syllables
(Michaud, 2004a) The experiment in warrants the conclusion that rising (Sb) and drop (6b) tones (i.e the tones of syllables ending in /p/, /t/ or /k/ - checked syllables)
are not glottalized, either in final or non-final position Therefore, it could be said
that there are 8 different tones in Vietnamese language The work on oral flow
(Michaud, Vu, Angelique, & Bernard, 2006) brings out a clear difference
between these two sets of rhymes: tone 6a (drop tone in unchecked syllables)
has low oral airflow; tone 5b and 6b have relatively high oral airflow, getting close to the range of breathy voice
15
Trang 18Table 1-6 Summarized description of Š tones of Viemamese
2 Huyền Falling Low Slightly Falling a Laxness, breathy
Specifically, phonetically detailed description of each tone which is summarized
from (Thompson, 1987)(Mixdorff, Nguyen, Fujisaki, & Luong, 2003)(Nguyén,
1997)(Michaud, 2004a) is as follows:
Tone 1 — level tone (“ngang”) is modal and sometimes lax and its contour is
nearly level in non-final syllables not accompanied by heavy stress, although even
in these cases it probably trails downward slightly
Tone 2 — falling tone (“huyén”) is lax, starts quite low and trails downward toward
the bottom of the voice range It is often accompanied by a kind of breathy voicing
(voiceless + modal), reminiscent of a sigh For some speakers it is even lax to the
point of breathiness with somewhat lowered subglottal air pressure
Tone 3 — broken tone (“nga”) is also high and rising, the FO contour being similar
to that of tone 5, but it is accompanied by the rasping voice quality (strong creaky
16
Trang 19voioc starting toward the middle of the vowel, which is then lesscning as the end of the syllable is approached) occasioned by tense glottal stricture, In careful speech such syllables are soructimes interrupted completely by a glotial stop (or a rapid
series of glottal stops) Its trajectory therefore sometimes shows a characteristic
break in the voicing at about half of the total duration of the syllable Many speakers begin the vowel with modal vaice, followed by strong creaky voice
starting toward the middle of the vowel
Tone 4— curve lone (“hai”) is Lense and drops rather abruplly Tt starts with modal
voice phonation, which moves mercasingly toward tense voice with accompanying harsh voice (although the harsh voice seems to vary according to speaker) In final
syllables, and especially in cilalion forms, this is followed by a sweeping rise
at the end, and for this reason it is often called the ‘dipping’ tone However,
non-final syllables seem only to have a brief level portion at the end, and this is
exceedingly clusive in rapid speech Although lone 4 is usually deseribed as a low falling and then rising tone, not all Vietnamese speakers have the nsing part Curve
and broken tones are both tense but their tension is not alike and is not cistributed
across the syllable in the same way
Tone Sa — rising tone (“s&c”) is high and rising (perhaps nearly level in rapid speech) and lense Phonetically, tone Sa is produced with modal voice
Tone Ga — drop tone (“nang”) is also tense: it starts somewhat lower than tone 4
Syllables bearing tone 6a have ihe same rasping voice quality as lone 3, drop very
sharply and are almost immediately cut off by a strong glottal stop Tone 6a is amuch shorter than other tones with a tendency to go lower
As for tones Sb and 6b, the orthography identifies tone 5b with tone Sa as sac and
tone Sb with tone 6a as nfng; which indicates the names that the tones carry in
present-day Vietnamese orthography However, tones Sb and 6b are not
glottalized, either in final or non-final position (Michaud, 2004a), Tone 6a is
17
Trang 20characterized by a gesture of strong constriction that is distinct from creaky voice;
tone 6b drops more sharply than tone 2, but it is never accompanied by the
Figure I-1 Schematic diagram of Hanoi Vietnamese tones (Michaud, 2004a)
This section has shown all issues involved in features of Vietnamese tones that
need to be taken into account when approaching the language The next section will
talk about the expression of expressive speech generally in common languages
1.2 Glottalized tones in the context of expressive speech: raising issues
Glottalization is a challenge for speech processing by disrupting FO estimations
(make it not clear how to measure), raising problem for averaging/ building a
model Specifically, most models of speech synthesis and recognition system
currently do not take the control of glottalization into account due to its complexity
In languages such as English: the issue may appear secondary, as glottalization is
not phonological in the standard variety Glottalization is a characteristic of certain
sociolects: creak in “drawl”, ‘glottaling’ of /t/, which is becoming increasingly
common in familiar speech, used to be stigmatized as “working-class”/vulgar
(Fabricius & Anne, 2002) Among national languages of Europe, only Danish
18
Trang 21possesses phonological glottalization (sted) (Fischer-Jorgenscn & Eli, 1989) There exist languages in which glottalization is controlled in greater phonological detail,
for instance languages of the Mon-Khnier family of languages, bul these languages are relatively less well-studied, and given the present state of the documentation, studies of the fine phonetic detail of these phenomena in discourse is seldom
perevived as a priorily by linguists (DiCanio & Christian, 2009)
Tanoi Vietnamese has a key role to play here: it has extremely rich glottalization phenomena; and as the official standard of a country wilh about 90° million inhabitants, it revcives increasing attention from specialists of spcoch technology A salient aspect of the Hanoi Vietnamese tone system is the use of phonation-type characicristies (Nguyen et al., 1998)(Brunclle, Nguyen, & Nguyen, 2010\Kirby, 2010)(Brunelle, 2009a), absent trom other dialects (Tran, 1969) Hanoi Vietnamese makes use of glottalization as part of the lexical specification of some of its lexical lones In particular, tones 6a and 3 are glollalived Tone 3 (also referred to by ils orthographic label, ngd, or the English descriptor “broken tone’) is a rising tone with
a strong glottalization in its first half Tone 6a (orthographic nang, ‘drop tone’) stars on a middle pitch and usually falls dramatically because of a strong glottalization in its second half It has been reported that glottal constriction for tone 6a is consistently present hoth in a ‘neutral’ context and in an ‘emphatic/impatient’ context (Michaud & Vu, 2004)
Glottalization in Vietnamese is not only a distinctive characteristic of tone: fine delails in its phonetic roalivalion can convey intomalional information Vieuwunese hhas salient intonational phenomena (Iran & Castelli, 2008) ‘The surface realization
of tones depends greatly on intonation: phrasing, prominence, and the expression of alliludes and cmotions Therefore, it appeared worlhwhile 10 investigate how speaker attitude affects the realization of glottalization, a phonetic dimension which
is cross-linguistically known to convey “paralinguislic” information (Fonagy, 1983)(Gobl & Ni Chasaide, 2003) Specifically, the research issue is: how fine-
19
Trang 22grained details in the phonetic realization of glottalized tones convey attitudinal information in Vietnantese expressive speeck
This is a challenge for speech processing: models such as Fujisaki’s (Mixdorff et
aL, 2003), which focuses exclusively on (0, would require substantial additions before they can handle such phenomena New-generation speech processing for Vietnamese will require facing the challenge of synthesis/fine tuning of phonation
types
1.3 The scope of the thesis
In view of the context set out above, the goal of the present study is to investigate the phonetic characteristics of glottalized tones in Vietnamese expressive spoceh, focusing on sentence-final particles Due to limitations of the present study, applications in speech processing will not be attempted The aim of the present sludy is lo provide # sufficiently detailed analysis of production data to pave the way for fresh work on the synthesis and recognition of attitudes in Vietnamese in
future
Mare precisely, we concentrate on studying tone 3 and tone 6a with three
alliludes: Declaration, Surprise and Trritation, since lhey have the clearest
perception (Mac, 2009) The objective is to answer the question that how these attitudes can change the realization of glottalization on these two tones and the use
of its special voice qualities Even so, the process of building speech corpus will not
be Limited on these objects only, so that it can serve for further research as well
14 Conclusion
This chapter has presented some overview of phonetics and phonology as well as
the phonetic characteristics of lexical tone system in Vietnamese After which, the
existing issues and the aulhor’s interests of glollized lones and expressive speech were given as the main point of the thesis Ln the next chapter, the author proposed
an approach of using expressive morphemes called Senence-final particles as the
20
Trang 23objects to study the glotalization in the interaction between lexical tone function and attitudinal function ‘his chapter will present the construction of our corpus for this
research,
21
Trang 24Chapter 2 BUILDING VIETNAMESE ATTITUDINAL SPEECH CORPUS FOR SENTENCE-FINAL PARTICLES
As discussed im the Tast chapter, this chapter will focus on the construction of
speech corpus which serves for investigation of the interplay between glottalized
tones and attiLudinal expression im Vietnamese Besides, several special SFPs which
carry both lexical tones and attitudinal information were used to construct targct sentences which concentrate on basic speaker attitudes and glottalized tones
There already exists a corpus designed for the study of social attindes in
Vietnamese (Mac et al., 2009), but it does not contain SFPs We therefore decided
to record new data, Speech data acquisition is an underestimated challenge (Niebubr
and Michaud), especially when attempting to capture such elusive aspects of speech
as attitudinal information Special attention was therefore paid to the elaboration of materials and recording, procedures
Tn particular, the research was divided into two phases and corresponding to these
bo phases, two different corpora were buill, The lirst phase conducted a pilot sludy with a small corpus and four speakers to initially explore hypotheses on SIP,
glollalived (ones and speaker atlitudes Afier thal, the second phase, with larger
corpus recorded with 20 speakers, expanded on the pilot study’s obtained results
Specifically, in the scope of the thesis, we aimed for demonstrating the qualitative
observalion results by concentraling on analyses of tone 3, tone 6a, threo studied attitudes and male speakers; the rest part of the built corpus was reserved for further research This chapter will present both of these two corpora
2.1 Method of using expressive morphemes carrying lexical tones —
Sentence-final particles
Languages differ in the means thal hey offer for the expression of afflitudes and
emotions In Linglish, intonation is known to fulfill a considerable range of
functions, including subtle nuances related to attitudes and emotions Japanese and
22
Trang 25Cantonese arc famous examples of languages that possess morphomes which have been described as performing functions that intonation does in a language such as Fuglish (Chan & Marjorie, 1999) For instance, in Cantonese, the particle (aFl? is used as an illustration, This particle is suffixed to a declarative sentence to convert the sentence into a question of disbelief or surprise (Wu 2008, p 24) or a “query to the truth of something” (Kwok 1984, p 88)
‘The particles specifically called sentence-final particles (hereafter SFPs) constitute
a marginal class of expressive words indicating speech ao types,
cvidontial/epistemic nuances, and affoctive/cmotional colouring There are about
ten SFPs in Mandarin, thirty in Cantonese (Kwok & Helen, 1984), and about the
same number in Vietnamese (Tran, 2010); SFPs are ubiquitous in casual,
conversational speech SFPs “often carry much of the meaning and function that
intonation does in non-tone languages” (Chan & Marjorie, 1998); the relationship is
nol simply one of fimclional equivalence between intonation and SFPs, however, since SKPs also carry intonational information: sentence-level intonational phenomena are known to cluster on SFPs One and the same SIP can take on different nuances (crealing different sense-cflevis) depending on the intonational
realization of the SKP itself (the ‘tune’ that it carries) and of the sentence as a whole
In Viotnamese, where they clearly have a tone of their own, SFPs provide an exemplary illustration of the superposition of tone and intonation An important
proportion of sentence-level imonation, conveying sentence mode, attitudes is
concentrated al the end of the utterance, on the SFP(s) (Do, Tran, & Georges, 1998)
This superposition affects FO (Neuyen & Tran, 2012), but also phonation types The
purpose of the present study is to investigate how speaker attitude affects the
realization of glottalization Lor the two glottalized tones 6a and 3 (orthographic
nang and nga) carried by SFPs A pilot study (Nguyen, Michaud, Tran, & Mao,
2013) suggests Unal glottalivation is phased earlier for surprise than for declaration,
and that uzitation also tends to be reflected in earlier glottalization, but with an
23
Trang 26added glottal stop/constrietion at the ond of the SFP The present study builds on a more extensive empirical basis, relying on materials that have been constructed to
be acotmpanied by a wider range of altitudes in Vietnamese
2.2 Designing sample corpus
Due to the great amount of carry-over tonal co-articulation in Vietnamese
(Brunelle, 2009b), the tones of SFPs are strongly influenced by those of preceding
syllables (Nguyen & Tran, 2012) The sludy therefore aimad lo devise a sentence using only syllables carrying tone | (ngang), which is phonetically the simplest: a
level, non-fow tone This resulled m sentence (1), used in our pilot study (Nguyen et
al., 2013):
propor_name t6 goảnp —— workplico/eornpany
‘This sentence was then associated with SIPs a [IPA: /a6a/], conveying politeness,
and da (IPA: /da3/], conveying tense-aspect-modality information This yields (2)
Lam lén céng ty ạ mủ (3) Lưan lên công (y dã Finally, sentences (1-3) were placed inside dialogues, which were precisely contextualized The attitudes under study are
G) pohtcness, associated lexically Lo the SFP a, and (2) declaration, irritation, and
surprise, clieited by context The general context is as follows: Lam, Minh and An are three friends who have just moved into a shared flat; today is Saturday, a day when they neither go to class nor to their workplace: but Lam is suddenly requested
to go to work for went business
However, it turned out that the Sl?P ạ [IPA: /aéa/] sometimes tended to coalesce with the preceding syllable, y ([TPA: /ui1/)), im carrier sentence (1) Tn hypor- articulated speech, the SHP /a6a/ begins with a glottal onset (empty-onset filler),
which sets it off from what precede Ilowever, the onset of this syllable is one of the
paramelers that strongly varies depending on context, including cases where there is
24
Trang 27no detectable initial glottalization on the acoustic signal, resulting in segmentation
difficulties This detracts from the precision of measurements
As a consequence, a slightly different set of materials was devised for the full- scale study The details are set out below
In the target sentence, a given name was chosen as grammatical subject Among the wealth of Vietnamese given names, Ba, meaning ‘three’, i.e ‘third child’, was
chosen for two main reasons, first, its vowel /a/: allows the possibility of phonetic
comparison of the vowel /a/ in Ba with that in a and d@ and the second is because of
the phonetic simplicity of its tone: 1, ngang, a non-low tone that is relatively level
Table 2-1 presents the speech materials Labels for the intended attitudes follow
the terminology proposed by (Mac, Aubergé, Rilliard, & Castelli, 2009), which
distinguishes 16 attitudes, and which treats sentence mode (declarative,
interrogative or imperative) as part of speaker attitude
Table 2-1 Intended attitudes
Where: ‘Sentence’ = contextualized sentence, DEC = Declaration, INT =
Interrogation, SUR = Surprise, OBV = Obviousness, IRR = Irritation, POL 1
2
Trang 28Politeness, AUT - Authority, SAR - Sarcastic Lrony Slots in grey indicate combinations judged implausible Politeness (POL) is conveyed semantically by the
SFP /a6a/
‘the data are not fully symmetrical because of the semantics of SI'Ps Attitudes of sarcastic irony and authority are antagonistic with the respect (acknowledgment of the addressee’s seniority) conveyed by the SEP /a6a/, likewise, surprise and interrogation are antagonistic with the assertiveness conveyed by the SEP /da3/, hence four empty slots in Table 2-1
Besides, two other SFPs were also used because we want lo demonstrate that lhe
final glottalization of /a6a/ and the medial glottalization of /da3/ are due to their
lexical tone, and not to intonational factors, so we confirmed this pomt by using
SFPs with lones that do not involve ylotlal constriction Tn order to cover the satne
range of attitudes a3 for the SFPs /aGa/ and /da3/, two different SFPs had to be used:
hd, carrying tone 4, is compatible with the expression of interrogation and surprise;
and mã, carrying tone 2, was recorded with the other four alliludes
Four specific target sentences are as follows:
1 Ba dihoca
2 Ba di hoe da
3 Ba di hoc ma
4 Ra di học bà
After which, the target sentence without SEP, accompanied by declarative
allilude and 4 sentences accomparied by respective allatudes as indicated in Table
2-1 were set in 17 specifically suitable contexts to facilitate recording with speakers
and to ensure the naturalness in utterances lor example, the context number 2 with
SEP q and interrogative attitude: Ba di hec a? (Does Ba ga to school?) was
expressed in the contexL when a Ba’s classmate comes and asks onc of Ba’s older
relatives just to get some extra information So, in that case, the classmate should
26
Trang 29show their respect and politeness Whereas, the context number 3 with SFP ¢ and surprise intends the situation when another Ba’s classmate comes to mest Ba because he thought that Ba was still staying at home Thon, the answer from Ba’s
brother or sister which asserted that Ba had gone to school brought him a big surprise,
Another example concerning the contexts number 8 and 10 with irtitation and sarcastic irony respectively can be given to illustrate The context number &
expresses a situation when Ba is dragged by some bad guys while he needs 10 go to
school right away, then an assertion together with irritation may be the best choice
to show his strong refusal Regarding other context of the context number 10, in the
casc, Ba (ells his roommate thal he needs 10 go to school immediately, perhaps for
some English classes, so the discussion between him and the roommate should be
temporarily stopped After that, the thought that Ba could not be such a hard student
who goes to schoot even in the wockend forces the roommate to tease Ba by
repeating his utterance with a tone of sarcastic irony
Above is some instances given to illustrate our context-based method in
collecting data; 17 adequate contexts have been created 1o Lit the selected attitudes
and SEPs (See Appendix A)
2.3 The progress of building the sample corpus
2.3.1 Elicitation method and speakers
in the pilot study, two different approaches to data collection were used The first aimed at maximal ecological validity, eliciting the intended attitude through
contextualization, from two speakers who were unaware of the purpose of the study
(Their speaker codes, assigned as par of a larger database, are M4 and Mã, respectively.) The second aimed at maximal clarity in contrasting different attitudes
two speech sciertists (M6 and M7) who were aware of the purpose of the slucdy
deliberately expressed the intended attitude as identified by the labels in Table 2-1
‘MA and MS are aged 24; they have university education in software engineering
27
Trang 30M6 and M7 are aged 26 and 31, respectively They were born in Hanoi, and are
permanent residents there, apart from a total of 2 years in France for M7 All four
can speak some English, and M7 is also fluent in French
For the full-scale study, it was possible to recruit a sizeable group of speakers (10 female and 10 male) from Hanoi Academy of Theatre and Cinema where
Vietnamese well-known actors are trained, ensuring the consistency of age groups
Besides, they are both from Department of Spoken Theatre, so their ability in orally
expressing different attitudes is the main point that should be highly evaluated No
group of speakers is ‘ideal’, and this choice raises concerns of naturalness: there is a
risk that the speakers are reproducing stereotyped pattems for the expression of
attitudes — patterns designed for the stage, that do not correspond to patterns found
in ordinary speech Great care was taken to verify perceptually that the intended
attitudes were recognized by persons outside the narrow circle of the performing
arts, using science and technology students as subjects for the perception tests The
information concerning the speakers is summarized below
Table 2-2 List of speakers
28
Trang 3115 | FIO | Female | 21 Quangninh English
16 | Fil |Female| 21 Haiduong, English
18 Fl3 | Female | 19 Hanoi English
20 FIS | Female | 22 Hanoi English
2.3.2 Recording conditions
The recordings were conducted at the MICA Institute's sound-treated booth The
participants received information about electroglottography (EGG) and its full
innocuousness (Fabre, 1957)(Baken, 1992) Then they were given time to
familiarize themselves with the scripts of the dialogues Questions were answered
through discussion of the context After this, the speakers read the dialogues three times, then swapped roles and read another three times They were instructed to
read ‘like actors’ — an indirect way to elicit a vivid, expressive dialogue
29
Trang 32
Figure 2-1 Speaker F7 (left) and M10 (right) in the recording booth
2.3.3 Post-processing and annotation
The average recording time for each speaker was between 25 — 35 minutes, totally, 10 hours of speech with 20 speakers were collected Particularly, the
electroglottographic signal from an EG2-PC (for one of the speakers) and the audio
(from one microphone for each speaker) were recorded as three synchronized WAV
files (44,100 Hz, 24-bit) After finished, among repeat recording samples for each
target sentence with each speaker, 6 best samples when speaker produces the most
natural voice were extracted and annotated in SoundForge (Sentence-level) and in
Praat (Syllabel-Level) as in Figure 2-2
30
Trang 33.= = oe sec
Figure 2-2 Sentence and Syllable Level Annotation with SoundF orge (above) and Praat
(below) of the corpus
The recordings made for the pilot study are available online as part of the MICA
Institute's AuCo project; long-term archiving and online availability are guaranteed
through the Pangloss Collection See
http://lacito vif enrs.fr/archivage/languages/Vietnamese_enhtm We plan to make
31
Trang 34the materials of the full study available online in future ideally by the time that the results of the full study are published
2.4 Conclusion
This chapter has presented the whole progress of the studied corpus with detailed
explanation of methods used for choosing the sample material as well as the group
of speakers and totally recording procedure ‘'o the best of our knowledge, this is
the first expressive speech corpus (hat can be used to study the varialion of
glottalization of tones and speaker attitudes in Victnamese In addition to this, 20
speakers are truly professional with good ability to impart there attitudinal
information
Furthermore, we did not limit at building a small corpus that covers only tone ng@
and nding with 3 attitudes: Surprise, Declaration and Irritation within the objectives
we aimed for Tnstead, 1o leverage the professional speakers, an addilional sample corpus which expands through various tones and attitudes was built as well By this
way, our corpus can be used for further research that orients to other aspects in
Vietnamese expressive speech processing including cross-gender studies since we
had an equal number of male and female speakers that were suitable for such a
research
In the scope of the thesis, a sufficient part of the corpus was exploited to introduce
a new approach on this issue The next chapter shows all observation results and illustrative figures after having thorough analyses, especially, the discussion at final
of the chapter will provide an insight into the way to generate a model based on
these analysis results
32
Trang 35Chapter 3 ANALYSING VARIATION IN REALIZATION OF GLOTTALIZED TONES BY VARIOUS ATTITUDES
As inentioncd in Chapter 2, there were two individual corpuses wilh lwo separate
groups of speaker that have been collected Particularly, the first one was recorded with 4 males accompanying by contexts of 4 attitudes only, and the other was done
wilh 10 males aud 10 fernales, which involves a wider range of aliitudes Among
them, the first one was used for a pilot study to preliminarily investigate and raise hypotheses of the interplay of glottalized tones and attitudes in Viemamese This study played a very important role leading to building of the sccond corpus which intends to a fall-scale study ‘his chapter will fully present both of these studies
3.1 Analysis Method
In order to analyze the IGG signal that has been recorded simultaneously with
acouslic signals, Ihe used method is based on the denvative of the EGG sigual
(Henrich, d’ Alessandro, Castellengo, & Deval, 2004), which allows for the measurement of cycle length (and hence FO), glottal open quotient (Oq) (Vu-Ngoc,
đt Alessandro, & Michaud, 2008), and a parameter called DECPA: Derivative- Hlectroglottographic Closure Peak Amplitude (Michaud, 2004b), Specifically, the BGG signal monitors the changes in vocal fold contact area It rises sharply when the glottis closes, reaches a maximum, then slowly decreases until the point where the vocal folds separate along their upper rim, at which point the GG signal decreases most rapidly The derivative (DEGG) of EGG typically has a positive peak at glottis closure and a negative peak at the opening, Figwe 3-1 and Figure 3-2 are illustrative figures visualizing a closing and opening instant in vocal fold contact area and accompanying by synchronization of EGG and DEGG signals
33
Trang 36
Figure 3-1 Visualization of closing instant synchronized with EGG (above) and DEGG (below)
signals (Henrich, 2001)
34