1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn modeling the prosody of vietnamese language for speech synthesis

104 2 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Modeling the prosody of vietnamese language for speech synthesis
Tác giả Mac Dang Khoa
Người hướng dẫn Prof. Pham Thi Ngoc Yen
Trường học Hanoi University of Technology
Chuyên ngành Information Processing and Communication
Thể loại Master thesis
Năm xuất bản 2007
Thành phố Hanoi
Định dạng
Số trang 104
Dung lượng 2,13 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis.. For tonal languages such as Vietnamese, the prosody of speech is composed of two

Trang 1

MINISTRY OF EDUCATION AND TRAINING

LIANOL UNIVERSITY OF TECIINOLOGY

Thesis for the degree of

MASTER OF SCTENCE

Modeling the prosody

of Vietnamese language for speech

Trang 2

HANOI UNIVERSITY OF TECHNOLOGY

Faculty of Information Technology

International research center of Multimedia Information, Communication and Application

Trang 3

supervisors: Dr Eric Castelli and Prof Pham Thi Ngoc Yén Thank you very much

for orienting and guiding my research in speech processing domain Thank you for

all your useful advices, your true criticisms and your patience during my time of

master research

Special thanks also gocs to Mrs Genevieve Caclcn-Haumont, PhD students Tran

Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group I could not

have done this thesis without your supports Thank all of you for all your

suggestions and your sincere remarks on entire of my research

1 would like to thank to Ms Đoàn Thị Ngọc Hiển, who guiding me in recording the corpus | would also like to thank to a lot of MICA member who spent much of time for recording and testing for my research

Tam grateful to Prof Nguyén Trong Giling and MICA’s directorate supporting, me the best convenient conditions during time working in Intemational Research Center MICA

Finally, I owe a great deal to my parents and my sister for ther continued support 1

also give a very special thanks to my girl friend for her constant encouragement,

giving me strength and motivation in my work and in my life

Mạc Đăng Khoa

Trang 4

Master thesis

Abstract

Text-To-Speceh (TTS) system is a computer system which is able to produce the speech fiom the text In the TTS system, the naturalness of the produced speech depends greatly on the variation of pitch, duration and energy during speaking, We call it the “prosody controlling ability’ A TTS system with good prosody controlling ability can be simulate the human speech prosody corresponding to the context of speaking

With tonal languages such as Vietnamese, the prosody of an utterance is the combination results of the two components: "micro-prosody" corresponding to the tone of each syllable in a sentence and "“macro-prosody” corresponding to the whole sentence

The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis It focuses on the influences of the macro-prosody on the

micro-prosody, in three types of sentence: assertive, interrogative and imperative

The first task is to set up a “prosody corpus” and extract all possible prosody

parameters Base on the extracted data, we defined seventy-two simple prosody

patterns for Viemamese syllables in three types of sentence After that, these patterns were applied lo synthesize some sitaple senlences Finally, some perception

experiments were taken to evaluate the: es The results shown

sentence types and the position of syllable ina sentence Tn the future, we expect to

contimie this research with more faclors of Vietnamesz prosody, improve our pattern and apply them Vietnamese TTS system

Mạc Đăng Khoa

Trang 5

Master thesis

Mạc Đăng Khoa

Trang 6

Master thesis

List of Figures

Figure 1-1: Category of methods for prcdicting syÏlable đuzation [6| 23

Figure 2-1: Example of the contows of six tones, as đescribzd in [21) 30

Figure 2-2: The shape of Tone 1 with femals and mnale voiee [18] 31

Figure 2-3: The shape of Tone 2 with female and male voiee [18 31

Figure 2-4: The shape of Tone 3 with female and male voiee [18] 32

Figure 2-5: The shape of Tone 4 with female and male voice [18] 32 Figure 2-6: The shape of Tone $ with female and male voice [18] 32 Figwe 2-7: The shape of Tone Sh with faiale and male voice [18] 33

Figure 2-8: The shape of Tone 6 wilh fermale and nuale voice [18] 33 Figure 2-9: The shape of Tone 6b with female and male voiee [18| 34

Figure 2-10: Sentence classificalion by sfrueture [20], « -.e ce 38 Figure 2-11: The sentences “Lan thich an com khéng” in - 36 Figure 2-12: The sentences “Bao cd ging tap di” in "—

Figure 2-13: The sentznces "Tân bỏ đi chứ” in 137

Figure 2-14: The differences of FU contour between Assertive and Interrogative

Figure 3-1: A general function diagram of TTS system [13] Xeeeeerree.4T Eigure 3-2: Pujisaki modl M Ôˆ- Eigure 3-3: Fujisaki mnodsl for tonal languags [19| s< ee eee dG Figure 3-4: Function diagram of proposal TTS system — BD Figure 3-5: Prosody generation module .cccsessessssestsiestassessieseseeee eee dB

Figure S-1 amuple of synthesized non-sense phrase - 7

Eigure 5-3: An cxamplc of synfhesized nulfi-typc senfcnees BÚ

Mạc Đăng Khoa

Trang 7

Master thesis

Figure 5-4: Interface for Perception test 2 seus Seo B2

Eigure 5-5: Correct reoognition rate with 8 tones of last syllable

Figure $-6: Correel recognition rate (%) with other Íypes of senences R6

Figurs $-7: Result comparison of three experiments Tuy 8?

Mạc Đăng Khoa

Trang 8

Master thesis

List of Tables

Table 1.2:Links between levels of representation of prosodic phenomena [13] 17

Table 1.3: Intonation modsl classiicaliơn mm eee LB 'Table 2.1:Vielnarnese oWells - cà nh 2n ereieiee TỰ: Table 2.2: Vielnarnese consonanfs Khai ĐỂ

Table 2.3: Arrangement of Vietnamese consonants 28

Table 2.4:The phonological hierarchy of Vietnamese syllables with total nmmbers of

cach phonetic unit (14] - - 28

Table 3.1: Comparison between direct patterm and ruodel patfeHn 3Ó Table 4.1: Prosody corpus siruetEe con nnrnreimisioeaooe S2

Table 4.2: Prosody corpus text information - 33

Table 4.3: Recording information oÈ Prosody eoTps co Table 5.1: Confusion matrix (in %) for 8 tones with male Voiee 73 Table 5.2: Confusion matrix (in %) for 8 tones with female voice 75 Table 5.3: Confusion matrix (%) of sentence types with male voice 6 Table 3.4: Confusion matrix (%) of sentence types with fernide voi 7 Table 5.5: Test data for Experiment 2 ccssesssnsnessuntntssntasentntrnenseennne ID Table 5.6: Confusion mattix (in %) of sentence types (wafh male voles) 82 Table 5.7: Confusion matrix (in %) of sentence types (wrth Eamale Voice) 6 B3 Table 5.8: Confluston matrix (in %) of sentence types (average of Male and Female)

K4

Table 5.9: Correct recognition rate (%) with other types of sentences 6

Mạc Đăng Khoa

Trang 9

LAA, The concept af prosody

1.1.2 Major components af wrosady

1.1.3 The fanetions of prosedy

1.14 Levele of representation of prosadic phenomena

1.2 Prosody modeling

1.2.1 Intonation model

1.2.2 Thuration madeling,

1.2.5, This thesis work approaoh ce.cseooeee

2 VIETNAMESE LANGUAGE AND PROSODY

3.2 Prosody generaliom

3.2.1 Overview of prosody generat

3.2.2 From lext lo prosody

3.3 Otherzesearches and our proposal

41 Prosody corpus

Mạc Đăng Khoa

Trang 10

4.2.2 Extracting prosody parameters of key-syllable

43 Proposal the pallems for Vietnamese prosody

4.3.1 Methodology

4.3.2 Trosody palterns

4.3.3 ome visual remarks on extracted pallcrns

5.1 Experiment 1: Tone and non-sense phrase

3.1L Objectives Hinh

5.12 Method and implementation

5.1.3 Results and discussion

5.2 Experiment 2: Malti-type sentences

52.1 Objeelives

3.2.2 Method and ‘Implementation

$.2.3, Results and discussion

3.3 Comparison and conclusion

A Text far prosody corpns - 95

B: Datasheet of prosody pattemns

Mạc Đăng Khoa

Trang 11

Chapter 0: Introduction

Introduction

Speech is the primary means of commumeation between people Speech synthesis,

automatic generation of speech waveforms, has been under development for several

decades Recent progress in speech synthesis has produced synthesizers with very high intelligibility but the sound quality and naturalness remain a major problem

Most of recent tesearelies atternpl to improve the naturalness of synthesized sound

to reach to human speech

In Viemam, there are currently some Vietnamese synthesis system like VnVoice (develop by Institute of Information Technology) or LloaSung (develop by International Research Center MICA) These researches obtained some encouraging

results However, to release their systems lo the markel yel, they have to improve the produced speech quality, especially the naturalness of speech prosody

part of MICA’s project: WN-Synthesis

With the research of PhI) student Tran Da Dat in MICA, we have already

developed a specch synthesis system using sound samples concafonation

teclmiquss The Ínsi version now ean produce sound Grom detailed (ext description,

which consists of:

Mạc Đăng Khoa

Trang 12

-10-

Chapter 0: Introduction

» The scquence of phonemes for composing the utterance: can be obtained

automatically from the raw text using a "phonetization” module, whose development is curently underway

«All information related to voice modulations: mostly pitch, energy and

duration variations that constitute the intonation or prosody of the uttered

statement We call it “prosady description”

For tonal languages such as Vietnamese, the prosody of speech is composed of two

components, which we call “micro-prosody” and “macro-prosody”:

* Micro-prosody is the variations of pitch, duration and intcnsity of individual word or syllable For tonal Janguage, the miero-prosody is very important to distinguish the syilable’s tone, Thus, the meaning of the synthesized sound greatly depend on the quality of micro-prosody

* Macro-prosody is the application of prosody to whole phrase or sentence

Hi depends on the type of sentence, speaker's intentions, the emotions etc

Therefore, the “naturalness” of synthesized speceh is depends on ability

of macro-prosody controlling during specch synthesis process

Objectives and Tasks

‘This thesis is part of MICA speech synthesis research and its main goal is to extract characteristics of Vietnamese prosody to generate the “prosody description” for

spsech synthes

In this thesis, we just focus on the differences of Vietnamese tones in different

positions in the sentence and in different types of sentences In other words, these

are the influcnaes of macro-prosudy on niiero-prosady

The first task is setting up a corpus for researching Vietnamese prosody With this

corpus, we extract and analysis parameters of findamental frequency, duration and intensity of the syllables in eight Viemamese tones, in three positions and in three type serlences

Mạc Đăng Khoa

Trang 13

-11-

Chapter 0: Introduction

After that, using these prosody parameters, we defined the simple prosody pattems for Vietnamese tones, corresponding to the cases of syllable in three types of senlence: assertive, inlerrogalive and imperative By applying these pallerns to 18 synthosive some simple sonterwes and doing some porcoplion cxpariment, we cam examine the appropriateness of thes

prosotly palterns, Thesis outline

This thesis is structured as follows:

© Chapter 1 starts with Section 1.1 giving some background on prosody, also some definitions and some term we use in this thesis book Section

1.2 bnefly presents modeling prosody and some prosodic nodes

« Chapter 2 gives an overview ot’ Vietnamese language and Vienamese

prosody

* Chapter 3 starts with the introduction of Text-to-Speceh system, the

general structure of TTS system and the prosody generation In last section of this chapter, we present some related work and propose a

simple structure for prosody generation module for TTS system

© Chapter 4: Section 3.1 and 3.2 describes our work of setting up and analyzing the Vietusmese prosotly corpus In section 3.3, we propose sel

of prosody patterns for tha Vietnamese syllables,

« In chapter 5, a series of perception experiments is presented for

evaluating our proposal patterns

© Chapter 6 complotus with the conclusions from the work presented int the

thesis and suggestions for further work

Mạc Đăng Khoa

Trang 14

Chapter 1 Prosody and Prosodic model

Prosody and Prosodic model

In this chapter, we give an overview of prosody and explain some terms we use in

this thesis The concept of modeling prosody and some prosodic models are also briefly presented after that

1.1 Overview of prosody

1.1.1 The concept of prosody

There is nel an cxacl definition of the term “prosody”, We can use the term

“prosody” broadly, mewning “a time series of speech-related information that is not

predictable from a reasonuble window (i.e word-siced or sentence-sized} applied ta

the phoneme sequence” [U]

Viewed in the Targe, prosody is # parallel chamel for communication, carrying some information thal carmot be simply deduced from the lexical charmed All aspoets of prosody arc transmitted by umuscle motions, and in most of them, the

recipient can perceive, fairly directiy, the motions of the speaker

Clearly, wilh thal broad definition of prosody, hand gestures, cycbrow and face mohons, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel However, in the domain of speech processing, we concentrate on the aspect of speech of prosody Thus, the prosody could inelude: “Pitch”, “Duration” and “Stress”, In the aspect of speech

Mạc Đăng Khoa

Trang 15

-13-

Chapter 1 Prosody and Prosodic model

signal, the prosody is represented by three components: “Fundamental frequency (F0)”, “Duration” and “Intensity”

“Provody” and “Intonation”

The torm prosody refers to cortsin propertics of the spucch signal such as andible changes in pitch, loudness, and syllable kength Por some authors the set of prosodic features also includes other aspects related to speceh timing such as rhythm and speech rate (13]

Some as a synonym for prosody usc the term intonation It is restricted to the tonal

(melodic) aspects of prosody by others In the thesis, intonation refers to pitch

variation in speech production and is part of prosody [13] In other words, we have:

Prosody = Iatonation + Duration

1.1.2 Major components of prosody

As we discuss above, the prosody consist of

* Pitch (Fondamental frequency Among prosodic cvent, the most overt arc changes in pitch, which together constitute the pitch contour of the utterance (FO contour of speech signal) Some analysis of sertences-lever

pitch contours show that the pitch contour of longer utterances can be broken down to a sequence of elementary contours, which can further be

divided into syllabic contours [13]

* Duration, duration in prosody is conceming to the length of sentence,

phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on

The duration of syllable and speech sounds depends on several

(dependent or interdependent) factor such as speech rate, rhythm,

phonetic nature, etc Most of case, the absolute duration of an event is

easily measured Ilowever sometime, it is not obvious to define the

Tnamdary of an event

Mạc Đăng Khoa

Trang 16

1.1.3

-14-

Chapter 1 Prosody and Prosodic model

Stress (intensity): sess is a prosodic property thal has been described since the very first work on prosody in phonetios It was said to be related

to loudness and phonology farce Both these characterizations reter to the perceptual form of prosody: the syllable carrying stress is prominent with respect to the surrounding syllables, cither due to its loudness or to its dynamic properties

The functions of prosody Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-

linguistic information, Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement Intonation types and para- linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e boundary tones) [1]

Prosody used to convey lexical meaning: Stress, accentual and tone languages

Stress language: English is an example of a stress language Stress

location is part of the lexical entry of gach Rnglish word For example,

"apple" and "orange" bath have strass on the first syllable, while

“banana” las stress on the sccund syllable, Whon an English word is spoken in isolation in declarative intonation, f0 typically peaks on the

stressed syllable

Accentual language: Japanese is an example of an accentual language A word is lexivally marked 2s accented (on # particular syllable) or mm accented, A simplified description is that pitch rises near the beginning of

an accentual phrase and falls on the accented syllable For detailed analysis, see Beckman and Pierrehumbert (1988)

‘Tone language: Mandarin, Vietnamese are the examples of a lexical tone language Each syllable is lexically marked with one lexical tones (

‘Tones have distinctive pitch contours Altering the pitch contour may

have tho consequenes of changing the lexival meaning of a word, and

Mạc Đăng Khoa

Trang 17

1s

Chapter 1 Prosody and Prosodic model

pethaps the meaning of a sentence, For example in Vietnamese, the meaning of syllables “ta” (we), “tà” (ap of dress), “18” (nappy), “ta” (to describe), “14” (hwelve), “la” (quinlal) are different

Prosody used to convey non-lexical information: Intonation type (Question vs

declarative sentences)

Languages may cmploy prosody in different ways to differentiate declarative

sentences from questions, A general trend is that questions are associated with higher pitch somewhere in the sentence, most commonly near the end This may be

manifested as a final rising contour, or higher/expanded pitch range near the end of the sentence In English, declavative intonation is tunkel by a falling ending while

yes-no questi

intonation is marked by a tising ono, as shown om the last digit

“onc” in the English examples Russian question, on the other hand, uses stroug

tail Chinese questions are manifested by

an expanded pitch ays near the ond of the sentences, however, the speaker

preserves the lexical tone shapes [1]

Prosody used to convey discourse functions: Focus, prominence, discourse

segments, etc

Topic initialization is typically associated with high piteh Pitch is typically taisud

in the discourse initial section and Jowered in the discourse final section Also, new information in the discouse structure is typically accented while old information

de-accented [1]

Prosody used to convey emotion

Most experiments studying emotional speech study stylized emotion, as delivered

by actors and actresses In these acted-out emotions, a few categories of emotions

and one can find consistent acoustic

can be reliably identifisd by iste

correlates of these categorics For cxample, oxeitement is expressed by high pitch and fast speed, while sadness is oxp

characterized by over-articulation, fast, downward pitch movement, and overall

by low pitch and slow speud Hot angeris

Mạc Đăng Khoa

Trang 18

-16~

Chapter 1 Prosody and Prosodic model

elevated pitch Cold anger shares many attributes with hot anger, but the pitch range

is set lower

The study of emotion in natural speech is a Jot more complicated It is generally recognized that speakers show mixed feelings and ambiguous states of mind, and

the emotions do not fall into clear cut categories.[1]

We have the summary of prosody functions in Table 1.1

Table 1.1: Prasody fictions

Linguistic (Lexicon | Paralinguistic Discouse function | Extia

- Accent - Assertive - Prominence -Sexof

1.1.4 Levels of representation of prosodic phenomena

As for other properties of the speech signal, prosodic events can be studied at various levels of representation (see Table 1.2) [13]

» First, the acoustic level; the acoustic manifestation of prosody (fundamental frequency, amplitude, and duration) can be measwed directly, using specialized hardware or algorithms (such as pitch determination algorithms)

© Second, the perceptual level represents the prosodic events as heard by

ihe listener As for spectral properties of speech sounds, acoustic

characteristies (hit can be measured arc not always poresplible The

Trang 19

Chapter 1 Prosody and Prosodic model

» Finally, the dimgyistic level represents the prosody of an ullerance as a

sequence of abstract units (signs, symbols), some of which have a

communicative function in speech, while others may just fulfill syntactic sequirements The linguistic structure of prosody is not some hidden code

that simply can be revealed using some standard procedure

Table 1.2:1Links betwean levels of representation of prosodic phonamend [13]

Fundamental fiequency (Fo) [Pitch Tone, intonation, aspect of stress

Given the diffrent mature of these represcutations, it is important to keep them apart It can be helpful to have the terminology reflect the lever of representation, For instance, measuring loudness does not equal measuring signal energy It is obvious that the perception of loudness is not exclusively related to the amplitude at one point of the signal, but also dependent on the duration of a speech fragment (the loudness of which we are measuring), and relative to the loudness of other parts in the signal,

AS one moves away from acoustic level towards the perceptual and/or linguistic

levels, the measurement of some given prosodic property will progressively involve segmentation (for example, into syllables), context (such as relative prominence), and structural information (the linguistic inkrprelation of a syllabic tone, for example, often depends on whether the related syllable is stressed or nol, which

Tequires a prior analysis of the segmental layer)

1.2 Prosody modeling

Prosodic models serve two purposes: (On one hand, they can be scientific hypotheses that explain how we communicate with each other, and what we communicate On the other hand, they can be engineered software systems that are part of'a dialog system or speech synthesiver To a lesser exten - and this is mostly

polential - a prosodic twodel can be the backgrouml for a system lo recognive

prosody in human speech

Mạc Đăng Khoa

Trang 20

-18-

Chapter 1 Prosody and Prosodic model

In general, a prosodic model is combined of two component, they are: intonation

model and duration model In this section, we word like fo give an overview of

some tuethods for prediction intonation (FO contours) and duration winch lave

avtuilly bcon applied in speceh synthosis

1.2.1 Intonation models

1211 Intonation model classification

The primary goal of intonation research is to model natural {0 contours of speech,

preferably in relation to a transcription and a description of the prosodic intent of the speaker The starting point of intonation research is the time series of FO But the interpretation of the PO information diverges widely among infomation twodels

The Table 1.3 represents a view of how onc can classify the various infomation

Under-epecitied | - = Tally Specified Single Component |INTSINT ToBI, Xu_ Till, IPO | Olive, Machine leaming Two components — | Grannum = Fijisaka_|-

Multiple components | - - - ‘Van Santen

Under-specified or Fully specified

The shape of an accent may be fully-specified ic defined without gaps) or under- spevified (defined by discomevicd regions ur isolated points) Alung anvther dimension, f0 values at any given time may be treated as a single component or as the combination of multiple components

The advantage of usrug an under-specified accent shape is that it allows sufficient

distance between specified accent targets Lo allow a, smooth 10 transition, typically

Trang 21

-19-

Chapter 1 Prosody and Prosodic model

On the other hand, a system with idly specified accents leaves little room to resolve contlicting targets A simple concatenation of fully-specitied accents will restit in a

pitch curve with unnatural jumps al the concatenation joints Many systerns, such as Fujisaki (1983, 1988), usz filters to smooth onl abrup! changes in FO Altornatively, vant Santon (1997, 2000) requires cach accent to begin and end al zore to ensure

smooth connections between accents

Single component or many components?

Many intonation models treat surface intonation contours as the superposition of 2

phrase component and an accent component Gronnum (1992) and Fujisaki (1983,

1988) are representatives of this view:

Well-defined model that fully specifies accent shape and uses multiple components

is Van Santen's model (van Santen and Mobius, 1997, 2000, van Santen et al., 1998), where accents are represented by densely populated points, providing a mechanism to describe highly complex accent shapes in detail We characterize van

Santer's system as having: multiple components, because in addition to the phrase

component, each aevent in the phrase alse adds a plmas

contributcs to the surface £0 contour,

wih component that

The advantage of multiple components is thal if provides a mechanism to separate

individual accents from long-term effects However, if onc allows multiple

cepmponents, then one necessarily fices the problema that there is no unique solution

in the decomposition of a single fO tinte series into multiple components [1] Any

such decomposition depends on a model of the speech process, and is only as good

as the underlymg model

tn contrast, Liberman and Picrrchumbert (1984) cxplicitly reject the notion of a

phrase curve and represent intonation contours as a single component The

advantage of representing £0 information as a single component is that the representation of accent heights will then be transparent, which lends itself to

convenient automatic labeling [1]

Mạc Đăng Khoa

Trang 22

1.212

Chapter 1 Prosody and Prosodic model

Some prosody models

The following give an over view of intonation models in Table 1.3

INTSINT (Hust et al., 2000) is an underspecified intonation system that defines an accent by a single point Fitting quadratic spline curves

through these points generates surtace f0

ToBI: The most widely used under-specified accent shape is represented

dy the ToBI model (Beckman and Ayers, 1997; Silverman et al 1992), which developed ftom earlier works such as Pierrshumbert (1980), Liberman and Picrrehumbert, (1984), and Piatchumbert and Beckman

(1988) Fach accent is represented hy na more than hvo points, which

specify abstractly the rclative contrast of high (H) and low (L} Onc goal

of the ToBI system is to specifi a minimal set of categorical labels for iaonation, These labels are usually imterpreted as phonological

distinction between accent types

Tilt (Taylor, 2000; Taylor, 1948) allows more samples than Tol near

the peak of an accent and leaves the other regions unspecified, hence its

status half way to a fully spovified system Tilt considers all accent yes

fo be continuous variations of a single class Surface variations arc

accounted for by changes in the continuous parameters

IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the pitch contour They then associate the slope and height of these lines with various typos of avconts Olive (1975) duseribed a vary arly fully-

specified system, following work by Levit and Rabiner (1970) His model stored the surface pitch vs time contour as a function of the prammatical structwe of the sentence The contow was then

Mạc Đăng Khoa

Trang 23

1.2.2

Chapter 1 Prosody and Prosodic model

approximated by polynomial splines attached to words, to allow for

duration variations,

Machine-learning: Scveral works using machine-leamning techniques

generate denscly sampled f0 valucs, including Chen ct al (1992) and

Malfiére et al, (1998) We classity these works as fully specified systems even though in some cases the concept of accent may not be clear Ross

and Ostendorf (1999) described an interesting machine learning system

where a discrete learning system would predict vectors attached to

phonemes and syllables, and these vectors would in tum drive a (leamed)

dynamical system to predict 1)

Fujisaki: Fujisaki’s phonctic intonation modcl (Fujisaki and Kawai,

1982) Fujisaki’s model was developed fiom the filter methed first proposed by ©” hman (1967) Fujisaki states that intonation contours are

comprised of two types of components, the phrase and the accent The production process is represented by a glottal oscillation mechanism which takes phrase and accent information as input and produces a

continuous FO contour as output The input to the mechanism is in the

form of impulses, used fo produce phrase shapes, and slep functions

3

Duration modeling

‘We now give a general overview of modeling the duration component of prosody

Common methods to predict duration in speech synthesis differ in the following

aspects: [6]

Mạc Đăng Khoa

Trang 24

Chapter 1 Prosody and Prosodic model

Durational Unit Predicted We temporal unit predicled by most cuent systems are either the phone (phoneme), often referred to as “segment”,

or the syllable, Since eventually phone duration are required for the acoustic synthesis, all syllable-based models include some kind of mechanism for calculating segment duration ftom the unit syllable duration, For example, in Barbosa and Bailly’s model, the basic unit is delimited by the onset of nuclear vowel and the onset of the following vowel They are computed hy a sequential network constrained by an internal clock (basically the spoakingg rate)

Predictor factors Every model uses a particular vector of input features,

which are extracted on the linguistic and phonetic levels Most commonly

employed factors include:

Y onthe syllabic level: the degree of accentuation and the position in

a higher-level unit, such as the foot or accent group

¥ on the segmental level: the properties of the phone to be synthesized and its neighboring phones

¥ on the phrase level: the location of a segment with respect to a

of methods for predicting syllable duration (6]Figure I-IErrol

Reference source not faund, the statistical approaches are subdivi

info paramnctric and non-parmuctiie regression models, Whereas the structure of a parametric regression model in term of how it processes the input factors is determined a pnori, non-paramettic regression models are developed by unsupervised training and the model structure is determined automatically (multilayer perceptrons, CARTs) The main difference between rule-base and statistical models is that a rule system can be build

Mạc Đăng Khoa

Trang 25

1.2.2

Chapter 1 Prosody and Prosodic model

on relatively little speech data The formulation of the zules, however, require a high amount of expert knowledge and considerable optimization

multiplicative praducts models models

GLM Figure 1-1: Category of methods for predicting syllable cheration [6]

Pause Prediction, Some current approaches incorporate the prediction of

speech pauses as part of model, others treat pauses strictly separately

Speech Rate Many current TTS systems produce differcnt spccch rates

by lmearly scalmg the duration output by the duration model As the speech rate not only affects the duration of individual segments, but also the overall prosodic structure of an utterance, this kind of modification

needs to take place on an earlier step of processing when the phrasal

structure of an utterance is determined

This thesis work approach

Modeling the intonation and duration in prosody is a complex field, relate to linguistic and acoustic field There are many different methods to predict the

Mạc Đăng Khoa

Trang 26

Chapter 1 Prosody and Prosodic model

intonation and duration of speech However, there is curently no methods completely apply in Vietnamese

In the scope of this thesis, we use the statistical approach to extract some basic

patterns, just tor modeling some basic cases in Vietnamese prosody

The following are some information about our approach to modeling Vietnamese

® Syllable level factors: Tone of syllable

© Extra-linguistic factors: Male/Female voice

This approach will be described more detail in Chapter 40

Mạc Đăng Khoa

Trang 27

Chapter 2: Vietnamese language and prosody

Vietnamese language and prosody

The understanding of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general and on prosody analysis in particular Thus, in this chapter, we give a review of Vietnamese language and Vietnamese prosody

2.1

2.1.1

Vietnamese language

Vietnamese characteristics

As we know, Vietnamese language is an amorphous language and a tonalémusival

language It has the following characteristics [21]:

1

`

Viemamese words are amorphous words, they do not change to show

grammatical categories, for instance, in French there are male and fervale

word émdiant - étudiante, nouveau — nouvelle, singular and plural word

“in”, “un”, impolite, unreadable, irregular

word slructure uses very few morphemes Victnamese language lus maximum twenty thousand syllables ta create iorphemes, thus

‘Vietnamese language docs not have the features of flexdonal languages

Mạc Đăng Khoa

Trang 28

-36-

Chapter 2: Vietnamese language and prosody

Vietnamese language’s morpheme index (number of morphemes M/ number

of words W) is about 1.06 [13], this 1s the least index in the 5000 languages

in the workd [13] The language, which ils morpheme is less than 2, is an amilytic language

‘The amorphous feature of our language is an essential characteristic, which

has an influence on other Vietnamese language’s characteristics

Vietnamese language is a tonal/mmsical language, Vietnamese language has six tones, and each tone could contribute to create the morpheme and meaning of word, eg, ba, bi, ba, ba, bã, bạ, me, mẻ, mè, mè, mẽ, me The tones make Vietnamese language Nave a musical characloristic; make

serdcmt

as thythmnic and melodious

A syllable Gsolated word) of Vistnamese language in full structure has five parts: initial sound (consonant), medial sound (semi-vowel), micleus sound (vowel or diphthang), nal sound (consonant or semi-vowel) and tone In

In Vietnamese, the boundary of syllable and morpheme’s is the same Onc

syllable is one murpheme In French: partir (come) has two syllables par-tir and two muvrphemes partir, vendeur (seller) has lwo syllables ven-dewe and two moiphemes vez#-Ͽ In English: words have one syllable and two

morphemes In Vietnamese: the sentence “Dep vô củng 6 qude ta vil” (TA

Hitu) has seven morphemes, seven syllables, and five words (three mono

wards: dep, ta, oi and two compound word: vé cimg, #6 qude) In conclusion,

one Vietnamese word unilis one syllable, one morphene and one seal erord

Mạc Đăng Khoa

Trang 29

Chapter 2: Vietnamese language and prosody

7 Almost Vietnamese vocabulary is created by one or two morphemes, and 1s monosyllable or bi-syllable, sometime polysyllable, There are 80% words

being bi-ayllable words

8 The difference between writing language and speaking language on

grammatical rules and phonetic rules is not large

9 Through the period of foundation and development of Victnamese language,

it has received quite many words fiom foreign languages Number of Han words 1s the greatest and next are French words, and a part of them were

converted fully into Viemamese For example, words: dau tranh, giai cap,

thoả bình, độc lập, tự do, hanh phic are Tan words (Chinese wards) Nita ga

(gave), xà yihémg (savon), cả phé (on!) are Pronch words

2.1.2 Vietnamese phoneme system

Vietnamese phoneme system includes 14 vowels or vowel combinations and 22

consonants

The Vietnamese vowels include 11 vowels and three diphthongs [21] All vowels

are voiced sounds

Table 2.1:Viemamese vowels

Alef ia, về 1a, i8, va yê | kia kìa, yêu kiều

“ai ua ta, tô tua rna, luôn luôn

jel OF ua ta, Hơi lưa thưa, lượt thượt

Mạc Đăng Khoa

Trang 30

-38-

Chapter 2: Vietnamese language and prosody

Vietnamese includes 22 consonants [21] as Table 2.2

Transcription Reading Letier Example

Based on these features, Viemamese consonants can be arranged as Table 2.3

Table 2.3: Arrangement of Vietnamese consonants

—~— articulate position apical

aricdste metiod lshiml ental [Tamia] Tat | dorsal gletai

Trang 31

6, & Besides the initial sound (called INITIAL part), the rest of the syllable 15 called a FINAL part A tone is a fundamental frequency variation spreading over the whole syllable A tone has the same tunction as a phoneme It always assigns for syllable and its influence covers the entire of syllable There are a few constraints: if

a syllable onds with unvoiced consonants /p,.k/, only “sic” and “nding” tones arc

possible; otherwise in all varieties of Viemamese, the whole tonal paradigm can occur

Table 2.4: The phonological hierarchy of Vietnamese syllables with total mmtbers of each

As a tonal language, Vietnamese prosody is composed of two components, which

we call “micro-prosody”’ and “macro-prosody”

© Micro-prosody is the variation of pitch, duration and intensity of individual word or syllable For tonal language, the wricroprosotly is very imuparlant to distinguish the syllable’s tone, Thus, the lexiesl meaning of

the synthesized sound much depends on the quality of micro-prosody

© Macro-prosody is the application of prosody to whole phrase or sentence It depends on the type of sentence and speaker's intentions or

Mạc Đăng Khoa

Trang 32

~30-

Chapter 2: Viemamese language and prosody

emotions Therefore, the "naturalness" of synthesized sentences is much

depends on ability of macro-prosody controlling during speech synthesis

process

2.2.1 Micro-prosody and tones system in Vietnamese

In Vietnamese, micro-prosody is much depends on the tone of syllable Each tone

could contribute to construct the morpheme and meaning of word, it is also a

distinguish signal The tone has the same fimetion as a phoneme, it always assigns

for syllable and its influence cover the entire of syllable The tones make

Vietnamese language have a musical characteristic, make sentences rhythmic and

melodious

There are six tones in Vietnamese; they are showed in the Table 2.5

Table 2.5 The six Vietnamese tones

Tone 1|Tone2 | Tone3 | Tone 4 | Tone 5 | Tone 6 ngang | hnyénV | nga — |hỏi'? | sic 7 [nang ~~

Figure 2-1: Example of the contours of six tones, as described in [21]

© Tone 1- Level tone (“ngang”): is a high tone At the beginning of syllable, it is the highest tone The steady state of the level contour is observed consistently In the below figure, you can see the shape of tone

Aạc Đăng Khoa

Trang 33

-31-

Chapter 2: Vietnamese language and prosody

1 for male and female voice (two line present the maximum and minimum of FO values)

Figura 2-2: The shape of Tone ! with female and male vaice [18]

© Tone 2 - Falling tone (“huyén”): the onset of the falling tone is lower than tone 1, tone 5 and tone 3 The low FO at the onset gradually falls toward the end

Figure 2-3: The shape of Tone 2 wath female and male vaice [18]

© Tone 3- Broken tone (“nga”): the onset is as high as that of the Level

contour of this tone is characterized by an abrupt dip caused by a

than the falling tone The s

glottalization, In most cases, the bottom of the dip occurs between the mid-point and the point two-thirds fiom anset A creaky voice 1s heard during this dip

Mạc Đăng Khoa

Trang 34

-32-

Chapter 2: Vietnamese language and prosody

gabantt du toa des supers Hmmuns pbard duton3 des suet: masculeis

tinct

Figure 2-4: The shape of Tone 3 with female and male voice [18]

Tone 4 - Curve tone (“héi”): the onset is the lowest among the six tones The low onset falls furlher gradually until the poi, two-thirds from the

onset, From this point, the extremely low FO starts to rise toward the ond

tive BI ợ me ims

Figure 2-5: The shape of Tone 4 with female and male voice [18]

Tone 5 - Rising tone (“sắc”): the onset is also high Starting from high

onset, the F0 gradually rises for the first two thirds of the duration After

this point, the rise becomes more rapid

gibaris du toasa des suets érainins ‘eabaris Gu teuSades sojes: masculins

Trang 35

-33-

Chapter 2: Vietnamese language and prosody

‘With tone 3 cnding with stop consonants (1,p,c,k), the onset is higher than tone 5a and the FO rise rapidly with short duration We call that tone is

Figure 2-7: The shape of Tone 5b with femate and male voice [18}

Tone 6 - Drop tone (“ning”): the onset is usually higher than that of the

falling or curve tone but considerably lower than the tone 1, tone 5 and

tone 3 This tone is characterized by a glottalization at the end and also by

its considerably shorter duration than the other Lones The duration of ims

tone is approximately two thirds of the other tones The main body of this tong is almost leveled or stightly falling,

wtbiail du tosis suels Rauinins gabsc1 du wouGa des cule needs

"Họ

_

Figure 2-6: The shape of Tone 6 with female and mate voice [18]

‘Tone 6b (tone six ending with stop consonants): the onset is nearly equal tone 2 The FO falls toward the end with short duration

Mạc Đăng Khoa

Trang 36

-341-

Chapter 2: Vietnamese language and prosody

pales dG ersaqeds muses,

_gihara dang des snp “eee

Figure 2-9: The shape of Tone b with female and male voice {18}

These descriptions are only for the Norther dialucl, in particular Hanoi dinleet

which is the standard chalect of Victnatnese They would be changed with the other

dialects in the South and the Center of Victnam In these regions, there are only 5

tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced

identically

Tn continuous spcouh, mos scldom reach their tirget valuws They are gencrally affected by context, stressed vs, unstiesscd syllable, influence of neighbouring tones, tempo and the affect of some phenomena in Vietnamese prosody, on which

we will discuss later

2.2.2 Macro-prosody and sentence types in Vietnamese

As we lalked above, the inacro-prosody depends on the type of sentence, speaker

Y¥ A assertive sentence or declaration: the most common type,

commonly makes a statement Ex: Tai s8 vé nha (J am going

home.)

Mạc Đăng Khoa

Trang 37

-38-

Chapter 2: Vietnamese language and prosody

¥ An auerragative sentence ot question: is commonly used to

request information Ex: Khi nao anh s& lam vide? (When are you

going to work?)

Y An imperative sentence or command is ordinarily used to wake a

denvand or request Ex: Mé eta ral (Open the door!)

v An exclamaiory sentence ot exclamation: is gencrally a more

emphatic form of statement Bx: Ngay hém nay tuyGt qua! OF hat a

wonderful day this ts!)

© Classification by structure: Sentences can also bs classified based on their structure (by the number and types of finite clauses) as the below diagram

Figure 2-10: Sentence classification hy structure [20]

With the scope of this thesis, we have just studied the macro-prosody of assertive,

interrogative and imperative sentence with single structure

In the researches of Nguydn and Boulakia [8], they gave some characteristics of prosody on three types of sentence (assertive, interrogative and imperative) as the following:

» Durali

nm (Tempo) Inlenogative sentences (Q) ame shorter than Asscrtive sentence (S) and this diffrence is significant Imperatives (D are even sherter, but the differences with Q and § are not significant

Mạc Đăng Khoa

Trang 38

-36-

Chapter 2: Viemamese language and prosody

Intensity: The difference is significant between assertive and imperative

for the S/I pair, but not for the S/Q and Q/I ones

Fundamental frequency The FO mean value of Interrogative sentences and Imperative utterances is higher than that of Statements, while there is

no difference between Interrogative and Imperative sentences There is an obvious difference in the last syllable The phonologically "level" (high) tone falls in Statements and is much higher and rising in Questions, while the mean value and movement is half way between for Imperatives The

rising tones, rise even more in the case of Interrogative and Imperative

than in Statement sentences It means that there is an influence of the intonation on the final-syllable tone of the sentence

Figure 2-11: The sentences “Lan thich én com khéng” in

Assertive (S) and Interrogative (O) mode [8]

Figure 2-12: The sentences “Bao cé gcing tp đi” im

Aạc Đăng Khoa

Trang 39

Figure 2-13: The sentences “Tân bỏ ẩi chứ” im

Interrogative (O) and Imperative (1) mode [8]

In the research of Vu M Q et al [16], they found that the main part of differences in intonation is at the end of the sentence (zone located on Figure 2-14 after the

vertical bar): the contour of the last syllable or of its second half tends to increase

for the interrogative sentences

Trang 40

-38-

Chapter 2: Vietnamese language and prosody

2.2.3 Some special phenomena in Vietnamese prosody

When researching Vietnamese prosody, we have to take in to account some following special phenomena in Vistnamese Two of them are “Glottalization” and

“Coarticulation”

Glottalization

Glotialization is the complete or partial closure of the glottis during the articulation

of another sound Glottalization of vewels and voiced consonants is most often realized as creaky voice (partial closure) Glottalization of voiceless consonants

usually involves complete closure of the glottal stop, another way to describe this

phenomenon is to say that a glottal stop is made simultaneously with another

consonant [24]

Based on glottalization feature, six Vietnamese tones can be classified into two

groups: tone 3 (“nga”) and tone 6 (nang”) are glottalized whereas the other tones

ave nor-glottalized Tone 3 accompanied by the rasping voice quatily occasioned by

tonse glottal stieture In careful speech such syllables are sometimes inierrupted completely by a glottal stop (or a rapid series of glottal stops) Ts Irajsolory

therefore sometimes shows a characteristic break in the voicing at about half of the

total duration of the syllable Tone 6 have the same rasping voice quality as tone 3,

drop very sharply and are almost immediately cut otf by a strong glottal stop

Hence, Vietnamese tones arc not only characterized by distinet FO trajcetories, but

also by articulatory distinctions and the prescnec/absence of glottalization

Coarticulation

Coarticualtion is the phenomenon in speech production in which sounds in succession overlap as compared to being produced as entirely separate sounds This phenomenon can be explained as bellows: [23]

© Ittakes only about a filth of a second to produce a syllable,

Mạc Đăng Khoa

Ngày đăng: 09/06/2025, 12:31

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm