The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis.. For tonal languages such as Vietnamese, the prosody of speech is composed of two
Trang 1MINISTRY OF EDUCATION AND TRAINING
LIANOL UNIVERSITY OF TECIINOLOGY
Thesis for the degree of
MASTER OF SCTENCE
Modeling the prosody
of Vietnamese language for speech
Trang 2
HANOI UNIVERSITY OF TECHNOLOGY
Faculty of Information Technology
International research center of Multimedia Information, Communication and Application
Trang 3
supervisors: Dr Eric Castelli and Prof Pham Thi Ngoc Yén Thank you very much
for orienting and guiding my research in speech processing domain Thank you for
all your useful advices, your true criticisms and your patience during my time of
master research
Special thanks also gocs to Mrs Genevieve Caclcn-Haumont, PhD students Tran
Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group I could not
have done this thesis without your supports Thank all of you for all your
suggestions and your sincere remarks on entire of my research
1 would like to thank to Ms Đoàn Thị Ngọc Hiển, who guiding me in recording the corpus | would also like to thank to a lot of MICA member who spent much of time for recording and testing for my research
Tam grateful to Prof Nguyén Trong Giling and MICA’s directorate supporting, me the best convenient conditions during time working in Intemational Research Center MICA
Finally, I owe a great deal to my parents and my sister for ther continued support 1
also give a very special thanks to my girl friend for her constant encouragement,
giving me strength and motivation in my work and in my life
Mạc Đăng Khoa
Trang 4Master thesis
Abstract
Text-To-Speceh (TTS) system is a computer system which is able to produce the speech fiom the text In the TTS system, the naturalness of the produced speech depends greatly on the variation of pitch, duration and energy during speaking, We call it the “prosody controlling ability’ A TTS system with good prosody controlling ability can be simulate the human speech prosody corresponding to the context of speaking
With tonal languages such as Vietnamese, the prosody of an utterance is the combination results of the two components: "micro-prosody" corresponding to the tone of each syllable in a sentence and "“macro-prosody” corresponding to the whole sentence
The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis It focuses on the influences of the macro-prosody on the
micro-prosody, in three types of sentence: assertive, interrogative and imperative
The first task is to set up a “prosody corpus” and extract all possible prosody
parameters Base on the extracted data, we defined seventy-two simple prosody
patterns for Viemamese syllables in three types of sentence After that, these patterns were applied lo synthesize some sitaple senlences Finally, some perception
experiments were taken to evaluate the: es The results shown
sentence types and the position of syllable ina sentence Tn the future, we expect to
contimie this research with more faclors of Vietnamesz prosody, improve our pattern and apply them Vietnamese TTS system
Mạc Đăng Khoa
Trang 5Master thesis
Mạc Đăng Khoa
Trang 6Master thesis
List of Figures
Figure 1-1: Category of methods for prcdicting syÏlable đuzation [6| 23
Figure 2-1: Example of the contows of six tones, as đescribzd in [21) 30
Figure 2-2: The shape of Tone 1 with femals and mnale voiee [18] 31
Figure 2-3: The shape of Tone 2 with female and male voiee [18 31
Figure 2-4: The shape of Tone 3 with female and male voiee [18] 32
Figure 2-5: The shape of Tone 4 with female and male voice [18] 32 Figure 2-6: The shape of Tone $ with female and male voice [18] 32 Figwe 2-7: The shape of Tone Sh with faiale and male voice [18] 33
Figure 2-8: The shape of Tone 6 wilh fermale and nuale voice [18] 33 Figure 2-9: The shape of Tone 6b with female and male voiee [18| 34
Figure 2-10: Sentence classificalion by sfrueture [20], « -.e ce 38 Figure 2-11: The sentences “Lan thich an com khéng” in - 36 Figure 2-12: The sentences “Bao cd ging tap di” in "—
Figure 2-13: The sentznces "Tân bỏ đi chứ” in 137
Figure 2-14: The differences of FU contour between Assertive and Interrogative
Figure 3-1: A general function diagram of TTS system [13] Xeeeeerree.4T Eigure 3-2: Pujisaki modl M Ôˆ- Eigure 3-3: Fujisaki mnodsl for tonal languags [19| s< ee eee dG Figure 3-4: Function diagram of proposal TTS system — BD Figure 3-5: Prosody generation module .cccsessessssestsiestassessieseseeee eee dB
Figure S-1 amuple of synthesized non-sense phrase - 7
Eigure 5-3: An cxamplc of synfhesized nulfi-typc senfcnees BÚ
Mạc Đăng Khoa
Trang 7Master thesis
Figure 5-4: Interface for Perception test 2 seus Seo B2
Eigure 5-5: Correct reoognition rate with 8 tones of last syllable
Figure $-6: Correel recognition rate (%) with other Íypes of senences R6
Figurs $-7: Result comparison of three experiments Tuy 8?
Mạc Đăng Khoa
Trang 8Master thesis
List of Tables
Table 1.2:Links between levels of representation of prosodic phenomena [13] 17
Table 1.3: Intonation modsl classiicaliơn mm eee LB 'Table 2.1:Vielnarnese oWells - cà nh 2n ereieiee TỰ: Table 2.2: Vielnarnese consonanfs Khai ĐỂ
Table 2.3: Arrangement of Vietnamese consonants 28
Table 2.4:The phonological hierarchy of Vietnamese syllables with total nmmbers of
cach phonetic unit (14] - - 28
Table 3.1: Comparison between direct patterm and ruodel patfeHn 3Ó Table 4.1: Prosody corpus siruetEe con nnrnreimisioeaooe S2
Table 4.2: Prosody corpus text information - 33
Table 4.3: Recording information oÈ Prosody eoTps co Table 5.1: Confusion matrix (in %) for 8 tones with male Voiee 73 Table 5.2: Confusion matrix (in %) for 8 tones with female voice 75 Table 5.3: Confusion matrix (%) of sentence types with male voice 6 Table 3.4: Confusion matrix (%) of sentence types with fernide voi 7 Table 5.5: Test data for Experiment 2 ccssesssnsnessuntntssntasentntrnenseennne ID Table 5.6: Confusion mattix (in %) of sentence types (wafh male voles) 82 Table 5.7: Confusion matrix (in %) of sentence types (wrth Eamale Voice) 6 B3 Table 5.8: Confluston matrix (in %) of sentence types (average of Male and Female)
K4
Table 5.9: Correct recognition rate (%) with other types of sentences 6
Mạc Đăng Khoa
Trang 9LAA, The concept af prosody
1.1.2 Major components af wrosady
1.1.3 The fanetions of prosedy
1.14 Levele of representation of prosadic phenomena
1.2 Prosody modeling
1.2.1 Intonation model
1.2.2 Thuration madeling,
1.2.5, This thesis work approaoh ce.cseooeee
2 VIETNAMESE LANGUAGE AND PROSODY
3.2 Prosody generaliom
3.2.1 Overview of prosody generat
3.2.2 From lext lo prosody
3.3 Otherzesearches and our proposal
41 Prosody corpus
Mạc Đăng Khoa
Trang 104.2.2 Extracting prosody parameters of key-syllable
43 Proposal the pallems for Vietnamese prosody
4.3.1 Methodology
4.3.2 Trosody palterns
4.3.3 ome visual remarks on extracted pallcrns
5.1 Experiment 1: Tone and non-sense phrase
3.1L Objectives Hinh
5.12 Method and implementation
5.1.3 Results and discussion
5.2 Experiment 2: Malti-type sentences
52.1 Objeelives
3.2.2 Method and ‘Implementation
$.2.3, Results and discussion
3.3 Comparison and conclusion
A Text far prosody corpns - 95
B: Datasheet of prosody pattemns
Mạc Đăng Khoa
Trang 11Chapter 0: Introduction
Introduction
Speech is the primary means of commumeation between people Speech synthesis,
automatic generation of speech waveforms, has been under development for several
decades Recent progress in speech synthesis has produced synthesizers with very high intelligibility but the sound quality and naturalness remain a major problem
Most of recent tesearelies atternpl to improve the naturalness of synthesized sound
to reach to human speech
In Viemam, there are currently some Vietnamese synthesis system like VnVoice (develop by Institute of Information Technology) or LloaSung (develop by International Research Center MICA) These researches obtained some encouraging
results However, to release their systems lo the markel yel, they have to improve the produced speech quality, especially the naturalness of speech prosody
part of MICA’s project: WN-Synthesis
With the research of PhI) student Tran Da Dat in MICA, we have already
developed a specch synthesis system using sound samples concafonation
teclmiquss The Ínsi version now ean produce sound Grom detailed (ext description,
which consists of:
Mạc Đăng Khoa
Trang 12-10-
Chapter 0: Introduction
» The scquence of phonemes for composing the utterance: can be obtained
automatically from the raw text using a "phonetization” module, whose development is curently underway
«All information related to voice modulations: mostly pitch, energy and
duration variations that constitute the intonation or prosody of the uttered
statement We call it “prosady description”
For tonal languages such as Vietnamese, the prosody of speech is composed of two
components, which we call “micro-prosody” and “macro-prosody”:
* Micro-prosody is the variations of pitch, duration and intcnsity of individual word or syllable For tonal Janguage, the miero-prosody is very important to distinguish the syilable’s tone, Thus, the meaning of the synthesized sound greatly depend on the quality of micro-prosody
* Macro-prosody is the application of prosody to whole phrase or sentence
Hi depends on the type of sentence, speaker's intentions, the emotions etc
Therefore, the “naturalness” of synthesized speceh is depends on ability
of macro-prosody controlling during specch synthesis process
Objectives and Tasks
‘This thesis is part of MICA speech synthesis research and its main goal is to extract characteristics of Vietnamese prosody to generate the “prosody description” for
spsech synthes
In this thesis, we just focus on the differences of Vietnamese tones in different
positions in the sentence and in different types of sentences In other words, these
are the influcnaes of macro-prosudy on niiero-prosady
The first task is setting up a corpus for researching Vietnamese prosody With this
corpus, we extract and analysis parameters of findamental frequency, duration and intensity of the syllables in eight Viemamese tones, in three positions and in three type serlences
Mạc Đăng Khoa
Trang 13-11-
Chapter 0: Introduction
After that, using these prosody parameters, we defined the simple prosody pattems for Vietnamese tones, corresponding to the cases of syllable in three types of senlence: assertive, inlerrogalive and imperative By applying these pallerns to 18 synthosive some simple sonterwes and doing some porcoplion cxpariment, we cam examine the appropriateness of thes
prosotly palterns, Thesis outline
This thesis is structured as follows:
© Chapter 1 starts with Section 1.1 giving some background on prosody, also some definitions and some term we use in this thesis book Section
1.2 bnefly presents modeling prosody and some prosodic nodes
« Chapter 2 gives an overview ot’ Vietnamese language and Vienamese
prosody
* Chapter 3 starts with the introduction of Text-to-Speceh system, the
general structure of TTS system and the prosody generation In last section of this chapter, we present some related work and propose a
simple structure for prosody generation module for TTS system
© Chapter 4: Section 3.1 and 3.2 describes our work of setting up and analyzing the Vietusmese prosotly corpus In section 3.3, we propose sel
of prosody patterns for tha Vietnamese syllables,
« In chapter 5, a series of perception experiments is presented for
evaluating our proposal patterns
© Chapter 6 complotus with the conclusions from the work presented int the
thesis and suggestions for further work
Mạc Đăng Khoa
Trang 14Chapter 1 Prosody and Prosodic model
Prosody and Prosodic model
In this chapter, we give an overview of prosody and explain some terms we use in
this thesis The concept of modeling prosody and some prosodic models are also briefly presented after that
1.1 Overview of prosody
1.1.1 The concept of prosody
There is nel an cxacl definition of the term “prosody”, We can use the term
“prosody” broadly, mewning “a time series of speech-related information that is not
predictable from a reasonuble window (i.e word-siced or sentence-sized} applied ta
the phoneme sequence” [U]
Viewed in the Targe, prosody is # parallel chamel for communication, carrying some information thal carmot be simply deduced from the lexical charmed All aspoets of prosody arc transmitted by umuscle motions, and in most of them, the
recipient can perceive, fairly directiy, the motions of the speaker
Clearly, wilh thal broad definition of prosody, hand gestures, cycbrow and face mohons, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel However, in the domain of speech processing, we concentrate on the aspect of speech of prosody Thus, the prosody could inelude: “Pitch”, “Duration” and “Stress”, In the aspect of speech
Mạc Đăng Khoa
Trang 15-13-
Chapter 1 Prosody and Prosodic model
signal, the prosody is represented by three components: “Fundamental frequency (F0)”, “Duration” and “Intensity”
“Provody” and “Intonation”
The torm prosody refers to cortsin propertics of the spucch signal such as andible changes in pitch, loudness, and syllable kength Por some authors the set of prosodic features also includes other aspects related to speceh timing such as rhythm and speech rate (13]
Some as a synonym for prosody usc the term intonation It is restricted to the tonal
(melodic) aspects of prosody by others In the thesis, intonation refers to pitch
variation in speech production and is part of prosody [13] In other words, we have:
Prosody = Iatonation + Duration
1.1.2 Major components of prosody
As we discuss above, the prosody consist of
* Pitch (Fondamental frequency Among prosodic cvent, the most overt arc changes in pitch, which together constitute the pitch contour of the utterance (FO contour of speech signal) Some analysis of sertences-lever
pitch contours show that the pitch contour of longer utterances can be broken down to a sequence of elementary contours, which can further be
divided into syllabic contours [13]
* Duration, duration in prosody is conceming to the length of sentence,
phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on
The duration of syllable and speech sounds depends on several
(dependent or interdependent) factor such as speech rate, rhythm,
phonetic nature, etc Most of case, the absolute duration of an event is
easily measured Ilowever sometime, it is not obvious to define the
Tnamdary of an event
Mạc Đăng Khoa
Trang 161.1.3
-14-
Chapter 1 Prosody and Prosodic model
Stress (intensity): sess is a prosodic property thal has been described since the very first work on prosody in phonetios It was said to be related
to loudness and phonology farce Both these characterizations reter to the perceptual form of prosody: the syllable carrying stress is prominent with respect to the surrounding syllables, cither due to its loudness or to its dynamic properties
The functions of prosody Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-
linguistic information, Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement Intonation types and para- linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e boundary tones) [1]
Prosody used to convey lexical meaning: Stress, accentual and tone languages
Stress language: English is an example of a stress language Stress
location is part of the lexical entry of gach Rnglish word For example,
"apple" and "orange" bath have strass on the first syllable, while
“banana” las stress on the sccund syllable, Whon an English word is spoken in isolation in declarative intonation, f0 typically peaks on the
stressed syllable
Accentual language: Japanese is an example of an accentual language A word is lexivally marked 2s accented (on # particular syllable) or mm accented, A simplified description is that pitch rises near the beginning of
an accentual phrase and falls on the accented syllable For detailed analysis, see Beckman and Pierrehumbert (1988)
‘Tone language: Mandarin, Vietnamese are the examples of a lexical tone language Each syllable is lexically marked with one lexical tones (
‘Tones have distinctive pitch contours Altering the pitch contour may
have tho consequenes of changing the lexival meaning of a word, and
Mạc Đăng Khoa
Trang 171s
Chapter 1 Prosody and Prosodic model
pethaps the meaning of a sentence, For example in Vietnamese, the meaning of syllables “ta” (we), “tà” (ap of dress), “18” (nappy), “ta” (to describe), “14” (hwelve), “la” (quinlal) are different
Prosody used to convey non-lexical information: Intonation type (Question vs
declarative sentences)
Languages may cmploy prosody in different ways to differentiate declarative
sentences from questions, A general trend is that questions are associated with higher pitch somewhere in the sentence, most commonly near the end This may be
manifested as a final rising contour, or higher/expanded pitch range near the end of the sentence In English, declavative intonation is tunkel by a falling ending while
yes-no questi
intonation is marked by a tising ono, as shown om the last digit
“onc” in the English examples Russian question, on the other hand, uses stroug
tail Chinese questions are manifested by
an expanded pitch ays near the ond of the sentences, however, the speaker
preserves the lexical tone shapes [1]
Prosody used to convey discourse functions: Focus, prominence, discourse
segments, etc
Topic initialization is typically associated with high piteh Pitch is typically taisud
in the discourse initial section and Jowered in the discourse final section Also, new information in the discouse structure is typically accented while old information
de-accented [1]
Prosody used to convey emotion
Most experiments studying emotional speech study stylized emotion, as delivered
by actors and actresses In these acted-out emotions, a few categories of emotions
and one can find consistent acoustic
can be reliably identifisd by iste
correlates of these categorics For cxample, oxeitement is expressed by high pitch and fast speed, while sadness is oxp
characterized by over-articulation, fast, downward pitch movement, and overall
by low pitch and slow speud Hot angeris
Mạc Đăng Khoa
Trang 18-16~
Chapter 1 Prosody and Prosodic model
elevated pitch Cold anger shares many attributes with hot anger, but the pitch range
is set lower
The study of emotion in natural speech is a Jot more complicated It is generally recognized that speakers show mixed feelings and ambiguous states of mind, and
the emotions do not fall into clear cut categories.[1]
We have the summary of prosody functions in Table 1.1
Table 1.1: Prasody fictions
Linguistic (Lexicon | Paralinguistic Discouse function | Extia
- Accent - Assertive - Prominence -Sexof
1.1.4 Levels of representation of prosodic phenomena
As for other properties of the speech signal, prosodic events can be studied at various levels of representation (see Table 1.2) [13]
» First, the acoustic level; the acoustic manifestation of prosody (fundamental frequency, amplitude, and duration) can be measwed directly, using specialized hardware or algorithms (such as pitch determination algorithms)
© Second, the perceptual level represents the prosodic events as heard by
ihe listener As for spectral properties of speech sounds, acoustic
characteristies (hit can be measured arc not always poresplible The
Trang 19Chapter 1 Prosody and Prosodic model
» Finally, the dimgyistic level represents the prosody of an ullerance as a
sequence of abstract units (signs, symbols), some of which have a
communicative function in speech, while others may just fulfill syntactic sequirements The linguistic structure of prosody is not some hidden code
that simply can be revealed using some standard procedure
Table 1.2:1Links betwean levels of representation of prosodic phonamend [13]
Fundamental fiequency (Fo) [Pitch Tone, intonation, aspect of stress
Given the diffrent mature of these represcutations, it is important to keep them apart It can be helpful to have the terminology reflect the lever of representation, For instance, measuring loudness does not equal measuring signal energy It is obvious that the perception of loudness is not exclusively related to the amplitude at one point of the signal, but also dependent on the duration of a speech fragment (the loudness of which we are measuring), and relative to the loudness of other parts in the signal,
AS one moves away from acoustic level towards the perceptual and/or linguistic
levels, the measurement of some given prosodic property will progressively involve segmentation (for example, into syllables), context (such as relative prominence), and structural information (the linguistic inkrprelation of a syllabic tone, for example, often depends on whether the related syllable is stressed or nol, which
Tequires a prior analysis of the segmental layer)
1.2 Prosody modeling
Prosodic models serve two purposes: (On one hand, they can be scientific hypotheses that explain how we communicate with each other, and what we communicate On the other hand, they can be engineered software systems that are part of'a dialog system or speech synthesiver To a lesser exten - and this is mostly
polential - a prosodic twodel can be the backgrouml for a system lo recognive
prosody in human speech
Mạc Đăng Khoa
Trang 20-18-
Chapter 1 Prosody and Prosodic model
In general, a prosodic model is combined of two component, they are: intonation
model and duration model In this section, we word like fo give an overview of
some tuethods for prediction intonation (FO contours) and duration winch lave
avtuilly bcon applied in speceh synthosis
1.2.1 Intonation models
1211 Intonation model classification
The primary goal of intonation research is to model natural {0 contours of speech,
preferably in relation to a transcription and a description of the prosodic intent of the speaker The starting point of intonation research is the time series of FO But the interpretation of the PO information diverges widely among infomation twodels
The Table 1.3 represents a view of how onc can classify the various infomation
Under-epecitied | - = Tally Specified Single Component |INTSINT ToBI, Xu_ Till, IPO | Olive, Machine leaming Two components — | Grannum = Fijisaka_|-
Multiple components | - - - ‘Van Santen
Under-specified or Fully specified
The shape of an accent may be fully-specified ic defined without gaps) or under- spevified (defined by discomevicd regions ur isolated points) Alung anvther dimension, f0 values at any given time may be treated as a single component or as the combination of multiple components
The advantage of usrug an under-specified accent shape is that it allows sufficient
distance between specified accent targets Lo allow a, smooth 10 transition, typically
Trang 21-19-
Chapter 1 Prosody and Prosodic model
On the other hand, a system with idly specified accents leaves little room to resolve contlicting targets A simple concatenation of fully-specitied accents will restit in a
pitch curve with unnatural jumps al the concatenation joints Many systerns, such as Fujisaki (1983, 1988), usz filters to smooth onl abrup! changes in FO Altornatively, vant Santon (1997, 2000) requires cach accent to begin and end al zore to ensure
smooth connections between accents
Single component or many components?
Many intonation models treat surface intonation contours as the superposition of 2
phrase component and an accent component Gronnum (1992) and Fujisaki (1983,
1988) are representatives of this view:
Well-defined model that fully specifies accent shape and uses multiple components
is Van Santen's model (van Santen and Mobius, 1997, 2000, van Santen et al., 1998), where accents are represented by densely populated points, providing a mechanism to describe highly complex accent shapes in detail We characterize van
Santer's system as having: multiple components, because in addition to the phrase
component, each aevent in the phrase alse adds a plmas
contributcs to the surface £0 contour,
wih component that
The advantage of multiple components is thal if provides a mechanism to separate
individual accents from long-term effects However, if onc allows multiple
cepmponents, then one necessarily fices the problema that there is no unique solution
in the decomposition of a single fO tinte series into multiple components [1] Any
such decomposition depends on a model of the speech process, and is only as good
as the underlymg model
tn contrast, Liberman and Picrrchumbert (1984) cxplicitly reject the notion of a
phrase curve and represent intonation contours as a single component The
advantage of representing £0 information as a single component is that the representation of accent heights will then be transparent, which lends itself to
convenient automatic labeling [1]
Mạc Đăng Khoa
Trang 221.212
Chapter 1 Prosody and Prosodic model
Some prosody models
The following give an over view of intonation models in Table 1.3
INTSINT (Hust et al., 2000) is an underspecified intonation system that defines an accent by a single point Fitting quadratic spline curves
through these points generates surtace f0
ToBI: The most widely used under-specified accent shape is represented
dy the ToBI model (Beckman and Ayers, 1997; Silverman et al 1992), which developed ftom earlier works such as Pierrshumbert (1980), Liberman and Picrrehumbert, (1984), and Piatchumbert and Beckman
(1988) Fach accent is represented hy na more than hvo points, which
specify abstractly the rclative contrast of high (H) and low (L} Onc goal
of the ToBI system is to specifi a minimal set of categorical labels for iaonation, These labels are usually imterpreted as phonological
distinction between accent types
Tilt (Taylor, 2000; Taylor, 1948) allows more samples than Tol near
the peak of an accent and leaves the other regions unspecified, hence its
status half way to a fully spovified system Tilt considers all accent yes
fo be continuous variations of a single class Surface variations arc
accounted for by changes in the continuous parameters
IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the pitch contour They then associate the slope and height of these lines with various typos of avconts Olive (1975) duseribed a vary arly fully-
specified system, following work by Levit and Rabiner (1970) His model stored the surface pitch vs time contour as a function of the prammatical structwe of the sentence The contow was then
Mạc Đăng Khoa
Trang 231.2.2
Chapter 1 Prosody and Prosodic model
approximated by polynomial splines attached to words, to allow for
duration variations,
Machine-learning: Scveral works using machine-leamning techniques
generate denscly sampled f0 valucs, including Chen ct al (1992) and
Malfiére et al, (1998) We classity these works as fully specified systems even though in some cases the concept of accent may not be clear Ross
and Ostendorf (1999) described an interesting machine learning system
where a discrete learning system would predict vectors attached to
phonemes and syllables, and these vectors would in tum drive a (leamed)
dynamical system to predict 1)
Fujisaki: Fujisaki’s phonctic intonation modcl (Fujisaki and Kawai,
1982) Fujisaki’s model was developed fiom the filter methed first proposed by ©” hman (1967) Fujisaki states that intonation contours are
comprised of two types of components, the phrase and the accent The production process is represented by a glottal oscillation mechanism which takes phrase and accent information as input and produces a
continuous FO contour as output The input to the mechanism is in the
form of impulses, used fo produce phrase shapes, and slep functions
3
Duration modeling
‘We now give a general overview of modeling the duration component of prosody
Common methods to predict duration in speech synthesis differ in the following
aspects: [6]
Mạc Đăng Khoa
Trang 24Chapter 1 Prosody and Prosodic model
Durational Unit Predicted We temporal unit predicled by most cuent systems are either the phone (phoneme), often referred to as “segment”,
or the syllable, Since eventually phone duration are required for the acoustic synthesis, all syllable-based models include some kind of mechanism for calculating segment duration ftom the unit syllable duration, For example, in Barbosa and Bailly’s model, the basic unit is delimited by the onset of nuclear vowel and the onset of the following vowel They are computed hy a sequential network constrained by an internal clock (basically the spoakingg rate)
Predictor factors Every model uses a particular vector of input features,
which are extracted on the linguistic and phonetic levels Most commonly
employed factors include:
Y onthe syllabic level: the degree of accentuation and the position in
a higher-level unit, such as the foot or accent group
¥ on the segmental level: the properties of the phone to be synthesized and its neighboring phones
¥ on the phrase level: the location of a segment with respect to a
of methods for predicting syllable duration (6]Figure I-IErrol
Reference source not faund, the statistical approaches are subdivi
info paramnctric and non-parmuctiie regression models, Whereas the structure of a parametric regression model in term of how it processes the input factors is determined a pnori, non-paramettic regression models are developed by unsupervised training and the model structure is determined automatically (multilayer perceptrons, CARTs) The main difference between rule-base and statistical models is that a rule system can be build
Mạc Đăng Khoa
Trang 251.2.2
Chapter 1 Prosody and Prosodic model
on relatively little speech data The formulation of the zules, however, require a high amount of expert knowledge and considerable optimization
multiplicative praducts models models
GLM Figure 1-1: Category of methods for predicting syllable cheration [6]
Pause Prediction, Some current approaches incorporate the prediction of
speech pauses as part of model, others treat pauses strictly separately
Speech Rate Many current TTS systems produce differcnt spccch rates
by lmearly scalmg the duration output by the duration model As the speech rate not only affects the duration of individual segments, but also the overall prosodic structure of an utterance, this kind of modification
needs to take place on an earlier step of processing when the phrasal
structure of an utterance is determined
This thesis work approach
Modeling the intonation and duration in prosody is a complex field, relate to linguistic and acoustic field There are many different methods to predict the
Mạc Đăng Khoa
Trang 26Chapter 1 Prosody and Prosodic model
intonation and duration of speech However, there is curently no methods completely apply in Vietnamese
In the scope of this thesis, we use the statistical approach to extract some basic
patterns, just tor modeling some basic cases in Vietnamese prosody
The following are some information about our approach to modeling Vietnamese
® Syllable level factors: Tone of syllable
© Extra-linguistic factors: Male/Female voice
This approach will be described more detail in Chapter 40
Mạc Đăng Khoa
Trang 27Chapter 2: Vietnamese language and prosody
Vietnamese language and prosody
The understanding of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general and on prosody analysis in particular Thus, in this chapter, we give a review of Vietnamese language and Vietnamese prosody
2.1
2.1.1
Vietnamese language
Vietnamese characteristics
As we know, Vietnamese language is an amorphous language and a tonalémusival
language It has the following characteristics [21]:
1
`
Viemamese words are amorphous words, they do not change to show
grammatical categories, for instance, in French there are male and fervale
word émdiant - étudiante, nouveau — nouvelle, singular and plural word
“in”, “un”, impolite, unreadable, irregular
word slructure uses very few morphemes Victnamese language lus maximum twenty thousand syllables ta create iorphemes, thus
‘Vietnamese language docs not have the features of flexdonal languages
Mạc Đăng Khoa
Trang 28-36-
Chapter 2: Vietnamese language and prosody
Vietnamese language’s morpheme index (number of morphemes M/ number
of words W) is about 1.06 [13], this 1s the least index in the 5000 languages
in the workd [13] The language, which ils morpheme is less than 2, is an amilytic language
‘The amorphous feature of our language is an essential characteristic, which
has an influence on other Vietnamese language’s characteristics
Vietnamese language is a tonal/mmsical language, Vietnamese language has six tones, and each tone could contribute to create the morpheme and meaning of word, eg, ba, bi, ba, ba, bã, bạ, me, mẻ, mè, mè, mẽ, me The tones make Vietnamese language Nave a musical characloristic; make
serdcmt
as thythmnic and melodious
A syllable Gsolated word) of Vistnamese language in full structure has five parts: initial sound (consonant), medial sound (semi-vowel), micleus sound (vowel or diphthang), nal sound (consonant or semi-vowel) and tone In
In Vietnamese, the boundary of syllable and morpheme’s is the same Onc
syllable is one murpheme In French: partir (come) has two syllables par-tir and two muvrphemes partir, vendeur (seller) has lwo syllables ven-dewe and two moiphemes vez#-Ͽ In English: words have one syllable and two
morphemes In Vietnamese: the sentence “Dep vô củng 6 qude ta vil” (TA
Hitu) has seven morphemes, seven syllables, and five words (three mono
wards: dep, ta, oi and two compound word: vé cimg, #6 qude) In conclusion,
one Vietnamese word unilis one syllable, one morphene and one seal erord
Mạc Đăng Khoa
Trang 29Chapter 2: Vietnamese language and prosody
7 Almost Vietnamese vocabulary is created by one or two morphemes, and 1s monosyllable or bi-syllable, sometime polysyllable, There are 80% words
being bi-ayllable words
8 The difference between writing language and speaking language on
grammatical rules and phonetic rules is not large
9 Through the period of foundation and development of Victnamese language,
it has received quite many words fiom foreign languages Number of Han words 1s the greatest and next are French words, and a part of them were
converted fully into Viemamese For example, words: dau tranh, giai cap,
thoả bình, độc lập, tự do, hanh phic are Tan words (Chinese wards) Nita ga
(gave), xà yihémg (savon), cả phé (on!) are Pronch words
2.1.2 Vietnamese phoneme system
Vietnamese phoneme system includes 14 vowels or vowel combinations and 22
consonants
The Vietnamese vowels include 11 vowels and three diphthongs [21] All vowels
are voiced sounds
Table 2.1:Viemamese vowels
Alef ia, về 1a, i8, va yê | kia kìa, yêu kiều
“ai ua ta, tô tua rna, luôn luôn
jel OF ua ta, Hơi lưa thưa, lượt thượt
Mạc Đăng Khoa
Trang 30-38-
Chapter 2: Vietnamese language and prosody
Vietnamese includes 22 consonants [21] as Table 2.2
Transcription Reading Letier Example
Based on these features, Viemamese consonants can be arranged as Table 2.3
Table 2.3: Arrangement of Vietnamese consonants
—~— articulate position apical
aricdste metiod lshiml ental [Tamia] Tat | dorsal gletai
Trang 316, & Besides the initial sound (called INITIAL part), the rest of the syllable 15 called a FINAL part A tone is a fundamental frequency variation spreading over the whole syllable A tone has the same tunction as a phoneme It always assigns for syllable and its influence covers the entire of syllable There are a few constraints: if
a syllable onds with unvoiced consonants /p,.k/, only “sic” and “nding” tones arc
possible; otherwise in all varieties of Viemamese, the whole tonal paradigm can occur
Table 2.4: The phonological hierarchy of Vietnamese syllables with total mmtbers of each
As a tonal language, Vietnamese prosody is composed of two components, which
we call “micro-prosody”’ and “macro-prosody”
© Micro-prosody is the variation of pitch, duration and intensity of individual word or syllable For tonal language, the wricroprosotly is very imuparlant to distinguish the syllable’s tone, Thus, the lexiesl meaning of
the synthesized sound much depends on the quality of micro-prosody
© Macro-prosody is the application of prosody to whole phrase or sentence It depends on the type of sentence and speaker's intentions or
Mạc Đăng Khoa
Trang 32~30-
Chapter 2: Viemamese language and prosody
emotions Therefore, the "naturalness" of synthesized sentences is much
depends on ability of macro-prosody controlling during speech synthesis
process
2.2.1 Micro-prosody and tones system in Vietnamese
In Vietnamese, micro-prosody is much depends on the tone of syllable Each tone
could contribute to construct the morpheme and meaning of word, it is also a
distinguish signal The tone has the same fimetion as a phoneme, it always assigns
for syllable and its influence cover the entire of syllable The tones make
Vietnamese language have a musical characteristic, make sentences rhythmic and
melodious
There are six tones in Vietnamese; they are showed in the Table 2.5
Table 2.5 The six Vietnamese tones
Tone 1|Tone2 | Tone3 | Tone 4 | Tone 5 | Tone 6 ngang | hnyénV | nga — |hỏi'? | sic 7 [nang ~~
Figure 2-1: Example of the contours of six tones, as described in [21]
© Tone 1- Level tone (“ngang”): is a high tone At the beginning of syllable, it is the highest tone The steady state of the level contour is observed consistently In the below figure, you can see the shape of tone
Aạc Đăng Khoa
Trang 33-31-
Chapter 2: Vietnamese language and prosody
1 for male and female voice (two line present the maximum and minimum of FO values)
Figura 2-2: The shape of Tone ! with female and male vaice [18]
© Tone 2 - Falling tone (“huyén”): the onset of the falling tone is lower than tone 1, tone 5 and tone 3 The low FO at the onset gradually falls toward the end
Figure 2-3: The shape of Tone 2 wath female and male vaice [18]
© Tone 3- Broken tone (“nga”): the onset is as high as that of the Level
contour of this tone is characterized by an abrupt dip caused by a
than the falling tone The s
glottalization, In most cases, the bottom of the dip occurs between the mid-point and the point two-thirds fiom anset A creaky voice 1s heard during this dip
Mạc Đăng Khoa
Trang 34-32-
Chapter 2: Vietnamese language and prosody
gabantt du toa des supers Hmmuns pbard duton3 des suet: masculeis
tinct
Figure 2-4: The shape of Tone 3 with female and male voice [18]
Tone 4 - Curve tone (“héi”): the onset is the lowest among the six tones The low onset falls furlher gradually until the poi, two-thirds from the
onset, From this point, the extremely low FO starts to rise toward the ond
tive BI ợ me ims
Figure 2-5: The shape of Tone 4 with female and male voice [18]
Tone 5 - Rising tone (“sắc”): the onset is also high Starting from high
onset, the F0 gradually rises for the first two thirds of the duration After
this point, the rise becomes more rapid
gibaris du toasa des suets érainins ‘eabaris Gu teuSades sojes: masculins
Trang 35-33-
Chapter 2: Vietnamese language and prosody
‘With tone 3 cnding with stop consonants (1,p,c,k), the onset is higher than tone 5a and the FO rise rapidly with short duration We call that tone is
Figure 2-7: The shape of Tone 5b with femate and male voice [18}
Tone 6 - Drop tone (“ning”): the onset is usually higher than that of the
falling or curve tone but considerably lower than the tone 1, tone 5 and
tone 3 This tone is characterized by a glottalization at the end and also by
its considerably shorter duration than the other Lones The duration of ims
tone is approximately two thirds of the other tones The main body of this tong is almost leveled or stightly falling,
wtbiail du tosis suels Rauinins gabsc1 du wouGa des cule needs
"Họ
_
Figure 2-6: The shape of Tone 6 with female and mate voice [18]
‘Tone 6b (tone six ending with stop consonants): the onset is nearly equal tone 2 The FO falls toward the end with short duration
Mạc Đăng Khoa
Trang 36-341-
Chapter 2: Vietnamese language and prosody
pales dG ersaqeds muses,
_gihara dang des snp “eee
Figure 2-9: The shape of Tone b with female and male voice {18}
These descriptions are only for the Norther dialucl, in particular Hanoi dinleet
which is the standard chalect of Victnatnese They would be changed with the other
dialects in the South and the Center of Victnam In these regions, there are only 5
tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced
identically
Tn continuous spcouh, mos scldom reach their tirget valuws They are gencrally affected by context, stressed vs, unstiesscd syllable, influence of neighbouring tones, tempo and the affect of some phenomena in Vietnamese prosody, on which
we will discuss later
2.2.2 Macro-prosody and sentence types in Vietnamese
As we lalked above, the inacro-prosody depends on the type of sentence, speaker
Y¥ A assertive sentence or declaration: the most common type,
commonly makes a statement Ex: Tai s8 vé nha (J am going
home.)
Mạc Đăng Khoa
Trang 37-38-
Chapter 2: Vietnamese language and prosody
¥ An auerragative sentence ot question: is commonly used to
request information Ex: Khi nao anh s& lam vide? (When are you
going to work?)
Y An imperative sentence or command is ordinarily used to wake a
denvand or request Ex: Mé eta ral (Open the door!)
v An exclamaiory sentence ot exclamation: is gencrally a more
emphatic form of statement Bx: Ngay hém nay tuyGt qua! OF hat a
wonderful day this ts!)
© Classification by structure: Sentences can also bs classified based on their structure (by the number and types of finite clauses) as the below diagram
Figure 2-10: Sentence classification hy structure [20]
With the scope of this thesis, we have just studied the macro-prosody of assertive,
interrogative and imperative sentence with single structure
In the researches of Nguydn and Boulakia [8], they gave some characteristics of prosody on three types of sentence (assertive, interrogative and imperative) as the following:
» Durali
nm (Tempo) Inlenogative sentences (Q) ame shorter than Asscrtive sentence (S) and this diffrence is significant Imperatives (D are even sherter, but the differences with Q and § are not significant
Mạc Đăng Khoa
Trang 38-36-
Chapter 2: Viemamese language and prosody
Intensity: The difference is significant between assertive and imperative
for the S/I pair, but not for the S/Q and Q/I ones
Fundamental frequency The FO mean value of Interrogative sentences and Imperative utterances is higher than that of Statements, while there is
no difference between Interrogative and Imperative sentences There is an obvious difference in the last syllable The phonologically "level" (high) tone falls in Statements and is much higher and rising in Questions, while the mean value and movement is half way between for Imperatives The
rising tones, rise even more in the case of Interrogative and Imperative
than in Statement sentences It means that there is an influence of the intonation on the final-syllable tone of the sentence
Figure 2-11: The sentences “Lan thich én com khéng” in
Assertive (S) and Interrogative (O) mode [8]
Figure 2-12: The sentences “Bao cé gcing tp đi” im
Aạc Đăng Khoa
Trang 39Figure 2-13: The sentences “Tân bỏ ẩi chứ” im
Interrogative (O) and Imperative (1) mode [8]
In the research of Vu M Q et al [16], they found that the main part of differences in intonation is at the end of the sentence (zone located on Figure 2-14 after the
vertical bar): the contour of the last syllable or of its second half tends to increase
for the interrogative sentences
Trang 40-38-
Chapter 2: Vietnamese language and prosody
2.2.3 Some special phenomena in Vietnamese prosody
When researching Vietnamese prosody, we have to take in to account some following special phenomena in Vistnamese Two of them are “Glottalization” and
“Coarticulation”
Glottalization
Glotialization is the complete or partial closure of the glottis during the articulation
of another sound Glottalization of vewels and voiced consonants is most often realized as creaky voice (partial closure) Glottalization of voiceless consonants
usually involves complete closure of the glottal stop, another way to describe this
phenomenon is to say that a glottal stop is made simultaneously with another
consonant [24]
Based on glottalization feature, six Vietnamese tones can be classified into two
groups: tone 3 (“nga”) and tone 6 (nang”) are glottalized whereas the other tones
ave nor-glottalized Tone 3 accompanied by the rasping voice quatily occasioned by
tonse glottal stieture In careful speech such syllables are sometimes inierrupted completely by a glottal stop (or a rapid series of glottal stops) Ts Irajsolory
therefore sometimes shows a characteristic break in the voicing at about half of the
total duration of the syllable Tone 6 have the same rasping voice quality as tone 3,
drop very sharply and are almost immediately cut otf by a strong glottal stop
Hence, Vietnamese tones arc not only characterized by distinet FO trajcetories, but
also by articulatory distinctions and the prescnec/absence of glottalization
Coarticulation
Coarticualtion is the phenomenon in speech production in which sounds in succession overlap as compared to being produced as entirely separate sounds This phenomenon can be explained as bellows: [23]
© Ittakes only about a filth of a second to produce a syllable,
Mạc Đăng Khoa