Studying the phonetic characteristics of glottalized tones in vietnamese expressive speech

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY --- Nguyen Thi Lan STUDYING THE PHONETIC CHARACTERISTICS OF GLOTTALIZED TONES IN VIETNAMESE EXPRESSIVE SP

Trang 2

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

-

Nguyen Thi Lan

STUDYING THE PHONETIC CHARACTERISTICS OF GLOTTALIZED

TONES IN VIETNAMESE EXPRESSIVE SPEECH

Department : INFORMATION TECHNOLOGY

Trang 3

1

COMMITMENT

I commit myself to be the person who was responsible for conducting this study All reference figures were extracted with clear derivation The presented results are truthful and have not published in any other person’s work

NGUYEN Thi Lan

Trang 4

2

ACKNOWLEDGEMENT This is the second time that I sit here, at Hanoi University of Science and Technology, with a great honor to write these grateful words to people who have been supporting me since the first moment I entered the university The first acknowledgement was written in my graduation thesis 2.5 years ago and today, this one just awakes a special emotion in me

I wish to thank all my professors and colleagues at School of Information and Communication Technology and MICA International Research Institute, who have helped me with generous supports Their advice and knowledge they imparted to me are gratefully appreciated, inspiring me a lot to finish this thesis

Special thanks to my supervisor Dr Tran Do Dat and colleagues of Speech Communication Department, MICA Institute, including Dr Do Thi Ngoc Diep, Nguyen Thi Thu Trang, Nguyen Tuan Ninh, Tran Thi Anh Xuan, Dr Nguyen Viet Son, Dr Nguyen Cong Phuong, Nguyen Duc Anh and Nguyen Tien Thanh, for their advice and encouragement they gave to me, especially Dr Mac Dang Khoa and Dr Alexis Michaud for their thorough review and invaluable suggestions Another thanks for two thesis reviewers including Assoc Prof Truong Ninh Thuan (VNU) and Dr Vu Thi Huong Giang (SOICT, HUST) for their worth comments which helped the thesis’s presentation become much better

Special thanks to my family and friends who always stand by me, lifting me up when I was down Without them, my life would be nonsense!

NGUYEN Thi Lan

Trang 5

3

CONTENTS

COMMITMENT 1

ACKNOWLEDGEMENT 2

LIST OF ABBREVIATIONS 5

LIST OF TABLES 6

LIST OF FIGURES 7

INTRODUCTION 8

Chapter 1 OVERVIEW 11

1.1 Background knowledge 11

1.1.1 Vietnamese phonetics and phonology 11

1.1.2 The phonetic characteristics of complex lexical tone system in Vietnamese 15

1.2 Glottalized tones in the context of expressive speech: raising issues 18

1.3 The scope of the thesis 20

1.4 Conclusion 20

Chapter 2 BUILDING VIETNAMESE ATTITUDINAL SPEECH CORPUS FOR SENTENCE-FINAL PARTICLES 22

2.1 Method of using expressive morphemes carrying lexical tones ̶ Sentence-final particles 22 2.2 Designing sample corpus 24

2.3 The progress of building the sample corpus 27

2.3.1 Elicitation method and speakers 27

2.3.2 Recording conditions 29

2.3.3 Post-processing and annotation 30

2.4 Conclusion 32

Chapter 3 ANALYSING VARIATION IN REALIZATION OF GLOTTALIZED TONES BY VARIOUS ATTITUDES 33

3.1 Analysis Method 33

3.2 A pilot data analysis: important new discoveries and an insight into the use of expressive morphemes and glottalized tones 36

3.2.1 Comparison of attitudes: Surprise and Declaration 37

3.2.2 Comparison of attitudes: Irritation and Declaration 38

Trang 6

4

3.3 Proposals for a full-scale study and statistical analyses of phonation-types based on EGG

and DEGG signal 40

3.3.1 Observation about the irregularities of DECPA 41

3.3.2 Building a tool for detection of creaky and pressed voice in Matlab 44

3.4 A full-scale study: detailed analysis results 47

3.4.1 Surprise and Declaration 48

3.4.2 Irritation and Declaration 50

3.5 Discussion 53

3.6 Conclusion 55

CONCLUSION AND PERSPECTIVES 56

REFERENCES 59

PUBLICATIONS 63

APPENDIX A: CONTEXTS OF EACH COLLECTED SENTENCE 63

APPENDIX B: FIGURES OF AVERAGED F0&Oq CONTOURS OF EACH SPEAKER WITH STANDARD DEVIATION FOR THE USED ATTITUDES 67

Trang 7

5

LIST OF ABBREVIATIONS SFP – Sentence-final particle

EGG – Electroglottography

DEGG – The derivative of the electroglottography signal

IPA – International phonetic association

DECPA – Derivative-Electroglottographic Closure Peak Amplitude

X-SAMPA – The Extended Speech Assessment Methods Phonetic Alphabet F0 – Fundamental frequency

Trang 8

6

LIST OF TABLES

Table 1-1 Vietnamese consonants 12

Table 1-2 Vietnamese vowels/diphthongs 13

Table 1-3 Phonetic characteristics of Vietnamese initial consonants 14

Table 1-4 Phonetic characteristics of Vietnamese final consonants 14

Table 1-5 Phonetic characteristics of Vietnamese vowels/diphthongs 15

Table 1-6 Summarized description of 8 tones of Vietnamese 16

Table 2-1 Intended attitudes 25

Table 2-2 List of speakers 28

Table 3-1 Statistics of Mechanism I-A/Pressed Voice/Mechanism I-B of tone 6a with attitude surprise 50

Table 3-2 Statistics of Mechanism I-A/Creaky Voice/Mechanism I-B of tone 3 with attitude declaration 52

Table 3-3 Statistics of Mechanism I-A/Pressed Voice/Mechanism I-B of tone 3 with attitude irritation 53

Trang 9

7

LIST OF FIGURES Figure 1-1 Schematic diagram of Hanoi Vietnamese tones (Michaud, 2004a) 18 Figure 2-1 Speaker F7 (left) and M10 (right) in the recording booth 30 Figure 2-2 Sentence and Syllable Level Annotation with SoundForge (above) and Praat (below) of the corpus 31 Figure 3-1 Visualization of closing instant synchronized with EGG (above) and DEGG (below) signals (Henrich, 2001) 34 Figure 3-2 Visualization of opening instant synchronized with EGG (above) and DEGG (below) signals (Henrich, 2001) 35 Figure 3-3 Example of EGG and DEGG signals with indication of glottis closure and opening 36 Figure 3-4 Two realizations of glottalization on SFP /a6a/ with two attitudes (a): declarative/neutral; (b): surprise Speaker M7 37 Figure 3-5 Average curves of F0 and Oq over 6 tokens of /a6a/, speaker M7 38 Figure 3-6 Two realizations of glottalization on SFP /ɗa3/ of two attitudes (a): declarative/neutral; (b): irritation Speaker M6 39 Figure 3-7 Average curves of F0 and Oq over 6 tokens of /ɗa3/, speaker M6 40 Figure 3-8 Determining mechanisms of voice based on DECPA and F0 parameters (each point of F0&Oq contour corresponds with a cycle on DEGG signal) 42 Figure 3-9 Determining the duration of pressed voice based on local dipping of Oq (each point of F0&Oq contour corresponds with a cycle on DEGG signal) 44 Figure 3-10 The tool for detection integrated three analysis modules 45 Figure 3-11 Some visually illustrative figures of creaky voice from the detection tool 47 Figure 3-12 Some visually illustrative figures of pressed voice from the detection tool 47 Figure 3-13 Averaged F0 contours of SFP /a6a/ of 10 male speakers: surprise and declaration 49 Figure 3-14 Averaged Oq contours of SFP /a6a/ of 10 male speakers: surprise (left) and declaration (right) 49 Figure 3-15 Averaged F0 contours of SFP /ɗa3/ of 10 male speakers: irritation and declaration 51 Figure 3-16 Averaged Oq contours of SFP /ɗa3/ of 10 male speakers: declaration 51 Figure 3-17 Averaged Oq contours of SFP /ɗa3/ of 10 male speakers: irritation 53 Figure 3-18 Proposed model for combination of speaker attitude, voice quality and glottalized tone in Vietnamese expressive speech processing 54

Trang 10

8

INTRODUCTION Nowadays, using speech in human-machine interaction is gradually becoming the major trend which promises to replace traditional communication methods: mouse, keyboard, screen, for example However, a high-quality human-machine interaction system that can completely behave as a human being, currently, is still just beyond our reach One of the primary reasons is because of the lack of advanced techniques that enable precisely processing (either synthesis or recognition) the expression of human utterances

The expression, in other words, refers to attitudinal or emotional aspects when someone speaks, which hereby can convey much linguistic information In this perspective, the attitudinal aspects in speaker utterances, also called speaker attitudes are of no small importance If speaker attitudes play such an important role

in the interactions between humans, they need to be taken into account in the interaction between humans and machines (Picard, 1997) Attitudinal information in

a spoken utterance can be lexically encoded but can also be conveyed by intonation, including modifications of voice quality (Seibert, 2003)

However, the modification of those features in Vietnamese is quite complex since

it has the interplay between intonation and tones; especially, the complexity even becomes much more complicated when dealing with glottalized tones which are

tone ngã and tone nặng Furthermore, in expressive speech, how the interplay can

be expressed, what its realization will be and with which mechanisms, are several among many questions set out

Among eight tones in Vietnamese, tone ngã and tone nặng are considered the

most complicated since they have glottalization phenomenon accompanied In most cases, with simpler tones, the interaction between intonation and tone simplifies to

be described by the changing in fundamental frequency, intensity or duration parameters, whereas with these two glottalized tones, these parameters are exactly

Trang 11

9

not sufficient since their glottalization phenomenon can vary a lot depending on context Obviously, there have been many researches that tried to approach this but, actually, they seem to avoid the most complicated aspect which is glottalization phenomenon in Vietnamese

Therefore, towards application in Vietnamese speech processing, the ultimate objective is to provide sufficient detail of the interplay between glottalized tones and intonation for both automatic speech recognition system and text-to-speech system in encoding and decoding attitudinal information in speaker’s utterances Specific contents of the thesis are as follows:

Chapter 1 presents overview of phonetic and phonology, tone and the expression

of attitudes in Vietnamese as well as existing issues that need to be dealt with and thesis’s approach

Chapter 2 and 3 show proposed methods for data acquisition and analysis which was based on EGG and DEGG signal in order to clarify the interaction mechanism between glottalized tones and expressive speech intonation

Finally, Chapter 4 gives some conclusions and perspectives for expanding the study to cover wider range of speaker attitudes and more tones in Vietnamese The obtained results include:

 Thesis Report

 Attitudinal Corpus: recorded with 10 males and 10 females

 Method and tool for detection and quantification of Creaky and Pressed Voice in Surprise/Irritation/Declaration Attitude

 1 International Conference Paper: INTERSPEECH 2013

Trang 12

10

 1 National Journal Paper: Journal of Science & Technology of Technical Universities in Vietnam, 101 (2014)

Trang 13

11

Similar to any other language, Vietnamese has a rich system of consonants and vowels together with various regulations of forming meaningful words However, one of the special characteristics which make it even more attractive in the eyes of researchers is that it has a complex lexical tones system So, why it is evaluated to

be complex and why the topic focusing on studying its tones system was chosen as major point of the thesis Furthermore, the author also conducted a research on expressive speech and emphasized that the relationship between tonal realization and attitudinal expression in Vietnamese should be taken seriously, is this a unique point that distinguishes Vietnamese from others? In this part, a brief introduction will be presented to bring you a clear look of Vietnamese phonetics and phonology Additionally, the section of raising issues will clarify the questions above as well as our interests

1.1 Background knowledge

1.1.1 Vietnamese phonetics and phonology

There has been many works involving in studying Vietnamese phonology system for years such as (Doan, 1977), (Nguyen, Edmondson, & Jerold, 1998), (Hwa-Froelich, Hodson, & Edwards, 2002), (Nguyen, Carre, & Castelli, 2008), (Michaud

& André-Georges, 2010) and (Hajek, 2008) Among these, there exists different concepts in establishing Vietnamese phonology system, but in general, the list of consonants and vowels in Vietnamese can be summarized respectively as in Table 1-1 and Table 1-2 in both IPA-symbol system and X-SAMPA-symbol system (Doan, 1977):

Trang 14

: initial before i, e, ê, y

Table 1-1 Vietnamese consonants

Trang 15

13

Table 1-2 Vietnamese vowels/diphthongs

Specifically, there are totally 24 phonemes of consonants for both initial and final positions; some of them are either initial only or final only while others can be both

or several phonemes just follow certain vowels Besides, there are only 9 long vowels, 4 short vowels and 3 diphthongs which are combination of single vowels

Table 1-3 and Table 1-4 describe phonetic characteristics of these consonants In these tables, the format to represent phonemes is “IPA-symbol (X-SAMPA-symbol)”, where the (XSAMPA symbol) part disappears if it is the same as the IPA-symbol For two variants of /ɲ/ and /k/, final consonants after /u ɔ o/, /ɲm/ is labial-velar nasal while /kp/ is voiceless labial-velar plosive (Hajek, 2008) (Doan, 1977)

Trang 16

14

Table 1-3 Phonetic characteristics of Vietnamese initial consonants

Green bold consonants: Not exist in Northern dialect Besides, for this dialect:

- ch- /c/ and tr- /ʈ/ are pronounced alike

- d-, gi- /z/ and r- /ʐ/ are pronounced alike

- x- /s/ and s- /ʂ/ are pronounced alike

Table 1-4 Phonetic characteristics of Vietnamese final consonants

Table 1-5 presents the phonetic characteristics of 16 vowels and diphthongs in Vietnamese Similar to other languages, they are distinguished from each other based on which part of the tongue is involved (front, central, back) and how high the tongue is when the sound is produced (high, mid, low)

Trang 17

15

Table 1-5 Phonetic characteristics of Vietnamese vowels/diphthongs

Above is a brief introduction on Vietnamese phonetics and phonology, the next session will present one of the problems that is always a challenge to anyone who

want to approach Vietnamese – Vietnamese tones system

1.1.2 The phonetic characteristics of complex lexical tone system in

Vietnamese

Vietnamese is a tonal language, that is the meaning of each word depends on the

"tone" in which it is pronounced Many other languages also use tones, such as Mandarin and Thai However, it can be said that Vietnamese tone system is relatively complex in comparison with the others since it has a six-tone paradigm for sonorant-final syllables, and a two-tone paradigm for obstruent-final syllables (Michaud, 2004a) The experiment in warrants the conclusion that rising (5b) and drop (6b) tones (i.e the tones of syllables ending in /p/, /t/ or /k/ - checked syllables) are not glottalized, either in final or non-final position Therefore, it could be said that there are 8 different tones in Vietnamese language The work on oral flow (Michaud, Vu, Angelique, & Bernard, 2006) brings out a clear difference between these two sets of rhymes: tone 6a (drop tone in unchecked syllables) has low oral airflow; tone 5b and 6b have relatively high oral airflow, getting close to the range of breathy voice

Trang 18

16

Table 1-6 Summarized description of 8 tones of Vietnamese

Specifically, phonetically detailed description of each tone which is summarized from (Thompson, 1987)(Mixdorff, Nguyen, Fujisaki, & Luong, 2003)(Nguyễn, 1997)(Michaud, 2004a) is as follows:

Tone 1 – level tone (“ngang”) is modal and sometimes lax and its contour is nearly level in non-final syllables not accompanied by heavy stress, although even

in these cases it probably trails downward slightly

Tone 2 – falling tone (“huyền”) is lax, starts quite low and trails downward toward the bottom of the voice range It is often accompanied by a kind of breathy voicing (voiceless + modal), reminiscent of a sigh For some speakers it is even lax to the point of breathiness with somewhat lowered subglottal air pressure

Tone 3 – broken tone (“ngã”) is also high and rising, the F0 contour being similar

to that of tone 5, but it is accompanied by the rasping voice quality (strong creaky

Trang 19

17

voice starting toward the middle of the vowel, which is then lessening as the end of the syllable is approached) occasioned by tense glottal stricture In careful speech such syllables are sometimes interrupted completely by a glottal stop (or a rapid series of glottal stops) Its trajectory therefore sometimes shows a characteristic break in the voicing at about half of the total duration of the syllable Many speakers begin the vowel with modal voice, followed by strong creaky voice starting toward the middle of the vowel

Tone 4 – curve tone (“hỏi”) is tense and drops rather abruptly It starts with modal voice phonation, which moves increasingly toward tense voice with accompanying harsh voice (although the harsh voice seems to vary according to speaker) In final syllables, and especially in citation forms, this is followed by a sweeping rise

at the end, and for this reason it is often called the ‘dipping’ tone However, non-final syllables seem only to have a brief level portion at the end, and this is exceedingly elusive in rapid speech Although tone 4 is usually described as a low falling and then rising tone, not all Vietnamese speakers have the rising part Curve and broken tones are both tense but their tension is not alike and is not distributed across the syllable in the same way

Tone 5a – rising tone (“sắc”) is high and rising (perhaps nearly level in rapid speech) and tense Phonetically, tone 5a is produced with modal voice

Tone 6a – drop tone (“nặng”) is also tense; it starts somewhat lower than tone 4 Syllables bearing tone 6a have the same rasping voice quality as tone 3, drop very sharply and are almost immediately cut off by a strong glottal stop Tone 6a is much shorter than other tones with a tendency to go lower

As for tones 5b and 6b, the orthography identifies tone 5b with tone 5a as sắc and tone 6b with tone 6a as nặng; which indicates the names that the tones carry in present-day Vietnamese orthography However, tones 5b and 6b are not glottalized, either in final or non-final position (Michaud, 2004a) Tone 6a is

Trang 20

18

characterized by a gesture of strong constriction that is distinct from creaky voice; tone 6b drops more sharply than tone 2, but it is never accompanied by the breathy quality of tone 6a

Figure 1-1 Schematic diagram of Hanoi Vietnamese tones (Michaud, 2004a)

This section has shown all issues involved in features of Vietnamese tones that need to be taken into account when approaching the language The next section will talk about the expression of expressive speech generally in common languages

1.2 Glottalized tones in the context of expressive speech: raising issues

Glottalization is a challenge for speech processing by disrupting F0 estimations (make it not clear how to measure), raising problem for averaging/ building a model Specifically, most models of speech synthesis and recognition system currently do not take the control of glottalization into account due to its complexity

In languages such as English: the issue may appear secondary, as glottalization is not phonological in the standard variety Glottalization is a characteristic of certain sociolects: creak in “drawl”; ‘glottaling’ of /t/, which is becoming increasingly common in familiar speech, used to be stigmatized as “working-class”/vulgar (Fabricius & Anne, 2002) Among national languages of Europe, only Danish

Trang 21

19

possesses phonological glottalization (stød) (Fischer-Jørgensen & Eli, 1989) There

exist languages in which glottalization is controlled in greater phonological detail, for instance languages of the Mon-Khmer family of languages, but these languages are relatively less well-studied, and given the present state of the documentation, studies of the fine phonetic detail of these phenomena in discourse is seldom perceived as a priority by linguists (DiCanio & Christian, 2009)

Hanoi Vietnamese has a key role to play here: it has extremely rich glottalization phenomena; and as the official standard of a country with about 90 million inhabitants, it receives increasing attention from specialists of speech technology A salient aspect of the Hanoi Vietnamese tone system is the use of phonation-type characteristics (Nguyen et al., 1998)(Brunelle, Nguyen, & Nguyen, 2010)(Kirby, 2010)(Brunelle, 2009a), absent from other dialects (Tran, 1969) Hanoi Vietnamese makes use of glottalization as part of the lexical specification of some of its lexical tones In particular, tones 6a and 3 are glottalized Tone 3 (also referred to by its

orthographic label, ngã, or the English descriptor ‘broken tone’) is a rising tone with

a strong glottalization in its first half Tone 6a (orthographic nặng, ‘drop tone’)

starts on a middle pitch and usually falls dramatically because of a strong glottalization in its second half It has been reported that glottal constriction for tone 6a is consistently present both in a ‘neutral’ context and in an ‘emphatic/impatient’ context (Michaud & Vu, 2004)

Glottalization in Vietnamese is not only a distinctive characteristic of tone: fine details in its phonetic realization can convey intonational information Vietnamese has salient intonational phenomena (Tran & Castelli, 2008) The surface realization

of tones depends greatly on intonation: phrasing, prominence, and the expression of attitudes and emotions Therefore, it appeared worthwhile to investigate how speaker attitude affects the realization of glottalization, a phonetic dimension which

is cross-linguistically known to convey “paralinguistic” information (Fónagy,

1983)(Gobl & Ní Chasaide, 2003) Specifically, the research issue is: how

Trang 22

1.3 The scope of the thesis

In view of the context set out above, the goal of the present study is to investigate the phonetic characteristics of glottalized tones in Vietnamese expressive speech, focusing on sentence-final particles Due to limitations of the present study, applications in speech processing will not be attempted The aim of the present study is to provide a sufficiently detailed analysis of production data to pave the way for fresh work on the synthesis and recognition of attitudes in Vietnamese in future

More precisely, we concentrate on studying tone 3 and tone 6a with three attitudes: Declaration, Surprise and Irritation, since they have the clearest perception (Mac, 2009) The objective is to answer the question that how these attitudes can change the realization of glottalization on these two tones and the use

of its special voice qualities Even so, the process of building speech corpus will not

be limited on these objects only, so that it can serve for further research as well

1.4 Conclusion

This chapter has presented some overview of phonetics and phonology as well as the phonetic characteristics of lexical tone system in Vietnamese After which, the existing issues and the author’s interests of glottalized tones and expressive speech were given as the main point of the thesis In the next chapter, the author proposed

an approach of using expressive morphemes called Sentence-final particles as the

Trang 23

21

objects to study the glotalization in the interaction between lexical tone function and attitudinal function This chapter will present the construction of our corpus for this research

Trang 24

22

SPEECH CORPUS FOR SENTENCE-FINAL PARTICLES

As discussed in the last chapter, this chapter will focus on the construction of speech corpus which serves for investigation of the interplay between glottalized tones and attitudinal expression in Vietnamese Besides, several special SFPs which carry both lexical tones and attitudinal information were used to construct target sentences which concentrate on basic speaker attitudes and glottalized tones

There already exists a corpus designed for the study of social attitudes in Vietnamese (Mac et al., 2009), but it does not contain SFPs We therefore decided

to record new data Speech data acquisition is an underestimated challenge (Niebuhr and Michaud), especially when attempting to capture such elusive aspects of speech

as attitudinal information Special attention was therefore paid to the elaboration of materials and recording procedures

In particular, the research was divided into two phases and corresponding to these two phases, two different corpora were built The first phase conducted a pilot study with a small corpus and four speakers to initially explore hypotheses on SFP, glottalized tones and speaker attitudes After that, the second phase, with larger corpus recorded with 20 speakers, expanded on the pilot study’s obtained results Specifically, in the scope of the thesis, we aimed for demonstrating the qualitative observation results by concentrating on analyses of tone 3, tone 6a, three studied attitudes and male speakers; the rest part of the built corpus was reserved for further research This chapter will present both of these two corpora

2.1 Method of using expressive morphemes carrying lexical tones ̶ Sentence-final particles

Languages differ in the means that they offer for the expression of attitudes and emotions In English, intonation is known to fulfill a considerable range of functions, including subtle nuances related to attitudes and emotions Japanese and

Trang 25

23

Cantonese are famous examples of languages that possess morphemes which have been described as performing functions that intonation does in a language such as English (Chan & Marjorie, 1999) For instance, in Cantonese, the particle /µEſ/ is used as an illustration This particle is suffixed to a declarative sentence to convert the sentence into a question of disbelief or surprise (Wu 2008, p 24) or a “query to the truth of something” (Kwok 1984, p 88)

The particles specifically called sentence-final particles (hereafter SFPs) constitute

a marginal class of expressive words indicating speech act types, evidential/epistemic nuances, and affective/emotional colouring There are about ten SFPs in Mandarin, thirty in Cantonese (Kwok & Helen, 1984), and about the same number in Vietnamese (Tran, 2010); SFPs are ubiquitous in casual, conversational speech SFPs “often carry much of the meaning and function that intonation does in non-tone languages” (Chan & Marjorie, 1998); the relationship is not simply one of functional equivalence between intonation and SFPs, however, since SFPs also carry intonational information: sentence-level intonational phenomena are known to cluster on SFPs One and the same SFP can take on different nuances (creating different sense-effects) depending on the intonational realization of the SFP itself (the ‘tune’ that it carries) and of the sentence as a whole

In Vietnamese, where they clearly have a tone of their own, SFPs provide an exemplary illustration of the superposition of tone and intonation An important proportion of sentence-level intonation, conveying sentence mode, attitudes is concentrated at the end of the utterance, on the SFP(s) (Do, Tran, & Georges, 1998) This superposition affects F0 (Nguyen & Tran, 2012), but also phonation types The purpose of the present study is to investigate how speaker attitude affects the realization of glottalization for the two glottalized tones 6a and 3 (orthographic nặng and ngã) carried by SFPs A pilot study (Nguyen, Michaud, Tran, & Mac, 2013) suggests that glottalization is phased earlier for surprise than for declaration, and that irritation also tends to be reflected in earlier glottalization, but with an

Trang 26

proper_name to_go/up workplace/company

This sentence was then associated with SFPs ạ [IPA: /a6a/], conveying politeness, and đã [IPA: /ɗa3/], conveying tense-aspect-modality information This yields (2)

Lam lên công ty ạ and (3) Lam lên công ty đã Finally, sentences (1-3) were placed

inside dialogues, which were precisely contextualized The attitudes under study are (i) politeness, associated lexically to the SFP ạ, and (ii) declaration, irritation, and surprise, elicited by context The general context is as follows: Lam, Minh and An are three friends who have just moved into a shared flat; today is Saturday, a day when they neither go to class nor to their workplace; but Lam is suddenly requested

to go to work for urgent business

However, it turned out that the SFP ạ [IPA: /a6a/] sometimes tended to coalesce

with the preceding syllable, ty ([IPA: /ti1/]), in carrier sentence (1) In

hyper-articulated speech, the SFP /a6a/ begins with a glottal onset (empty-onset filler), which sets it off from what precede However, the onset of this syllable is one of the parameters that strongly varies depending on context, including cases where there is

Trang 27

25

no detectable initial glottalization on the acoustic signal, resulting in segmentation

difficulties This detracts from the precision of measurements

As a consequence, a slightly different set of materials was devised for the scale study The details are set out below

full-In the target sentence, a given name was chosen as grammatical subject Among

the wealth of Vietnamese given names, Ba, meaning ‘three’, i.e ‘third child’, was

chosen for two main reasons; first, its vowel /a/: allows the possibility of phonetic

comparison of the vowel /a/ in Ba with that in ạ and đã and the second is because of the phonetic simplicity of its tone: 1, ngang, a non-low tone that is relatively level

Table 2-1 presents the speech materials Labels for the intended attitudes follow the terminology proposed by (Mac, Aubergé, Rilliard, & Castelli, 2009), which distinguishes 16 attitudes, and which treats sentence mode (declarative, interrogative or imperative) as part of speaker attitude

Table 2-1 Intended attitudes

Trang 28

26

Politeness, AUT = Authority, SAR = Sarcastic Irony Slots in grey indicate combinations judged implausible Politeness (POL) is conveyed semantically by the SFP /a6a/

The data are not fully symmetrical because of the semantics of SFPs Attitudes of sarcastic irony and authority are antagonistic with the respect (acknowledgment of the addressee’s seniority) conveyed by the SFP /a6a/; likewise, surprise and interrogation are antagonistic with the assertiveness conveyed by the SFP /ɗa3/, hence four empty slots in Table 2-1

Besides, two other SFPs were also used because we want to demonstrate that the final glottalization of /a6a/ and the medial glottalization of /ɗa3/ are due to their lexical tone, and not to intonational factors, so we confirmed this point by using SFPs with tones that do not involve glottal constriction In order to cover the same range of attitudes as for the SFPs /a6a/ and /ɗa3/, two different SFPs had to be used:

hả, carrying tone 4, is compatible with the expression of interrogation and surprise;

and mà, carrying tone 2, was recorded with the other four attitudes

Four specific target sentences are as follows:

SFP ạ and interrogative attitude: Ba đi học ạ? (Does Ba go to school?) was

expressed in the context when a Ba’s classmate comes and asks one of Ba’s older relatives just to get some extra information So, in that case, the classmate should

Trang 29

27

show their respect and politeness Whereas, the context number 3 with SFP ạ and

surprise intends the situation when another Ba’s classmate comes to meet Ba because he thought that Ba was still staying at home Then, the answer from Ba’s brother or sister which asserted that Ba had gone to school brought him a big surprise

Another example concerning the contexts number 8 and 10 with irritation and sarcastic irony respectively can be given to illustrate The context number 8 expresses a situation when Ba is dragged by some bad guys while he needs to go to school right away, then an assertion together with irritation may be the best choice

to show his strong refusal Regarding other context of the context number 10, in the case, Ba tells his roommate that he needs to go to school immediately, perhaps for some English classes, so the discussion between him and the roommate should be temporarily stopped After that, the thought that Ba could not be such a hard student who goes to school even in the weekend forces the roommate to tease Ba by repeating his utterance with a tone of sarcastic irony

Above is some instances given to illustrate our context-based method in collecting data; 17 adequate contexts have been created to fit the selected attitudes and SFPs (See Appendix A)

2.3 The progress of building the sample corpus

2.3.1 Elicitation method and speakers

In the pilot study, two different approaches to data collection were used The first

aimed at maximal ecological validity, eliciting the intended attitude through contextualization, from two speakers who were unaware of the purpose of the study (Their speaker codes, assigned as part of a larger database, are M4 and M5,

respectively.) The second aimed at maximal clarity in contrasting different attitudes:

two speech scientists (M6 and M7) who were aware of the purpose of the study deliberately expressed the intended attitude as identified by the labels in Table 2-1 M4 and M5 are aged 24; they have university education in software engineering

Trang 30

28

M6 and M7 are aged 26 and 31, respectively They were born in Hanoi, and are permanent residents there, apart from a total of 2 years in France for M7 All four can speak some English, and M7 is also fluent in French

For the full-scale study, it was possible to recruit a sizeable group of speakers (10 female and 10 male) from Hanoi Academy of Theatre and Cinema where Vietnamese well-known actors are trained, ensuring the consistency of age groups Besides, they are both from Department of Spoken Theatre, so their ability in orally expressing different attitudes is the main point that should be highly evaluated No group of speakers is ‘ideal’, and this choice raises concerns of naturalness: there is a risk that the speakers are reproducing stereotyped patterns for the expression of attitudes – patterns designed for the stage, that do not correspond to patterns found

in ordinary speech Great care was taken to verify perceptually that the intended attitudes were recognized by persons outside the narrow circle of the performing arts, using science and technology students as subjects for the perception tests The information concerning the speakers is summarized below

Table 2-2 List of speakers

Speaker Code Gender Age background Dialect languages Foreign

Trang 31

29

Speaker Code Gender Age background Dialect languages Foreign

2.3.2 Recording conditions

The recordings were conducted at the MICA Institute’s sound-treated booth The participants received information about electroglottography (EGG) and its full innocuousness (Fabre, 1957)(Baken, 1992) Then they were given time to familiarize themselves with the scripts of the dialogues Questions were answered through discussion of the context After this, the speakers read the dialogues three times, then swapped roles and read another three times They were instructed to read ‘like actors’ – an indirect way to elicit a vivid, expressive dialogue

Trang 32

30

Figure 2-1 Speaker F7 (left) and M10 (right) in the recording booth

2.3.3 Post-processing and annotation

The average recording time for each speaker was between 25 – 35 minutes, totally, 10 hours of speech with 20 speakers were collected Particularly, the electroglottographic signal from an EG2-PC (for one of the speakers) and the audio (from one microphone for each speaker) were recorded as three synchronized WAV files (44,100 Hz, 24-bit) After finished, among repeat recording samples for each target sentence with each speaker, 6 best samples when speaker produces the most natural voice were extracted and annotated in SoundForge (Sentence-level) and in Praat (Syllabel-Level) as in Figure 2-2

Trang 33

31

Figure 2-2 Sentence and Syllable Level Annotation with SoundForge (above) and Praat

(below) of the corpus

The recordings made for the pilot study are available online as part of the MICA Institute’s AuCo project; long-term archiving and online availability are guaranteed

http://lacito.vjf.cnrs.fr/archivage/languages/Vietnamese_en.htm We plan to make

Trang 34

Furthermore, we did not limit at building a small corpus that covers only tone ngã and nặng with 3 attitudes: Surprise, Declaration and Irritation within the objectives

we aimed for Instead, to leverage the professional speakers, an additional sample corpus which expands through various tones and attitudes was built as well By this way, our corpus can be used for further research that orients to other aspects in Vietnamese expressive speech processing including cross-gender studies since we had an equal number of male and female speakers that were suitable for such a research

In the scope of the thesis, a sufficient part of the corpus was exploited to introduce

a new approach on this issue The next chapter shows all observation results and illustrative figures after having thorough analyses, especially, the discussion at final

of the chapter will provide an insight into the way to generate a model based on these analysis results

Trang 35

33

GLOTTALIZED TONES BY VARIOUS ATTITUDES

As mentioned in Chapter 2, there were two individual corpuses with two separate groups of speaker that have been collected Particularly, the first one was recorded with 4 males accompanying by contexts of 4 attitudes only, and the other was done with 10 males and 10 females, which involves a wider range of attitudes Among them, the first one was used for a pilot study to preliminarily investigate and raise hypotheses of the interplay of glottalized tones and attitudes in Vietnamese This study played a very important role leading to building of the second corpus which intends to a full-scale study This chapter will fully present both of these studies

3.1 Analysis Method

In order to analyze the EGG signal that has been recorded simultaneously with acoustic signals, the used method is based on the derivative of the EGG signal (Henrich, d’ Alessandro, Castellengo, & Doval, 2004), which allows for the measurement of cycle length (and hence F0), glottal open quotient (Oq) (Vu-Ngoc, d’ Alessandro, & Michaud, 2005), and a parameter called DECPA: Derivative-Electroglottographic Closure Peak Amplitude (Michaud, 2004b) Specifically, the EGG signal monitors the changes in vocal fold contact area It rises sharply when the glottis closes, reaches a maximum, then slowly decreases until the point where the vocal folds separate along their upper rim, at which point the EGG signal decreases most rapidly The derivative (DEGG) of EGG typically has a positive peak at glottis closure and a negative peak at the opening Figure 3-1 and Figure 3-2 are illustrative figures visualizing a closing and opening instant in vocal fold contact area and accompanying by synchronization of EGG and DEGG signals

Trang 36

34

Figure 3-1 Visualization of closing instant synchronized with EGG (above) and DEGG (below)

signals (Henrich, 2001)

Định dạng
Số trang	73
Dung lượng	3,16 MB