1. Trang chủ
  2. » Luận Văn - Báo Cáo

Modeling the prosody of vietnamese language for speech synthesis

105 51 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 105
Dung lượng 2,08 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY --- Thesis for the degree of MASTER OF SCIENCE Modeling the prosody of Vietnamese language for speech synthesis Specia

Trang 1

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY -

Thesis for the degree of MASTER OF SCIENCE

Modeling the prosody

of Vietnamese language for speech

synthesis Speciality: “Information processing and Communication”

Code: 23.04.3898

MẠC ĐĂNG KHOA

Supervisor:

Prof PHẠM THỊ NGỌC YẾN

Trang 2

Faculty of Information Technology

International research center of

M ultimedia I nformation, C ommunication and A pplication

Trang 4

Special thanks also goes to Mrs Geneviève Caelen-Haumont, PhD students Trần

Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group I could not have done this thesis without your supports Thank all of you for all your suggestions and your sincere remarks on entire of my research

I would like to thank to Ms Đoàn Thị Ngọc Hiền, who guiding me in recording the corpus I would also like to thank to a lot of MICA member who spent much of time for recording and testing for my research

I am grateful to Prof Nguyễn Trọng Giảng and MICA’s directorate supporting me the best convenient conditions during time working in International Research Center MICA

Finally, I owe a great deal to my parents and my sister for their continued support I also give a very special thanks to my girl friend for her constant encouragement, giving me strength and motivation in my work and in my life

Trang 5

Master thesis

Abstract

Text-To-Speech (TTS) system is a computer system which is able to produce the speech from the text In the TTS system, the naturalness of the produced speech depends greatly on the variation of pitch, duration and energy during speaking We call it the “prosody controlling ability” A TTS system with good prosody controlling ability can be simulate the human speech prosody corresponding to the context of speaking

With tonal languages such as Vietnamese, the prosody of an utterance is the combination results of the two components: "micro-prosody" corresponding to the tone of each syllable in a sentence and "macro-prosody" corresponding to the whole sentence

The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis It focuses on the influences of the macro-prosody on the micro-prosody, in three types of sentence: assertive, interrogative and imperative The first task is to set up a “prosody corpus” and extract all possible prosody parameters Base on the extracted data, we defined seventy-two simple prosody patterns for Vietnamese syllables in three types of sentence After that, these patterns were applied to synthesize some simple sentences Finally, some perception experiments were taken to evaluate these synthesized sentences The results shown that the proposed patterns can be applied successfully to generate the prosody of simple sentence

This work is our preliminary work in Vietnamese prosody, just concerning the sentence types and the position of syllable in a sentence In the future, we expect to continue this research with more factors of Vietnamese prosody, improve our pattern and apply them Vietnamese TTS system

Trang 6

Master thesis

Trang 7

Master thesis

List of Figures

Figure 1-1: Category of methods for predicting syllable duration [6] 23

Figure 2-1: Example of the contours of six tones, as described in [21] 30

Figure 2-2: The shape of Tone 1 with female and male voice [18] 31

Figure 2-3: The shape of Tone 2 with female and male voice [18] 31

Figure 2-4: The shape of Tone 3 with female and male voice [18] 32

Figure 2-5: The shape of Tone 4 with female and male voice [18] 32

Figure 2-6: The shape of Tone 5 with female and male voice [18] 32

Figure 2-7: The shape of Tone 5b with female and male voice [18] 33

Figure 2-8: The shape of Tone 6 with female and male voice [18] 33

Figure 2-9: The shape of Tone 6b with female and male voice [18] 34

Figure 2-10: Sentence classification by structure [20] 35

Figure 2-11: The sentences “Lan thích ăn cơm không” in 36

Figure 2-12: The sentences “Bảo cố gắng tập đi” in 36

Figure 2-13: The sentences “Tân bỏ đi chứ” in 37

Figure 2-14: The differences of F0 contour between Assertive and Interrogative sentence [16] 37

Figure 3-1: A general function diagram of TTS system [13] 41

Figure 3-2: Fujisaki model 46

Figure 3-3: Fujisaki model for tonal language [19] 46

Figure 3-4: Function diagram of proposal TTS system 47

Figure 3-5: Prosody generation module 48

Figure 4-1: Key-syllable segmentation 56

Figure 4-2: Extracting F0 contour using PRAAT 57

Figure 4-3: An example of prosody pattern 60

Figure 5-1: An example of synthesized non-sense phrase 73

Figure 5-2: Perception test 1 74

Trang 8

Master thesis

Figure 5-4: Interface for Perception test 2 82

Figure 5-5: Correct recognition rate with 8 tones of last syllable 85

Figure 5-6: Correct recognition rate (%) with other types of sentences 86

Figure 5-7: Result comparison of three experiments 87

Trang 9

Master thesis

List of Tables

Table 1.1: Prosody functions 16

Table 1.2:Links between levels of representation of prosodic phenomena [13] 17

Table 1.3: Intonation model classification 18

Table 2.1:Vietnamese vowels .27

Table 2.2:Vietnamese consonants .28

Table 2.3: Arrangement of Vietnamese consonants .28

Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each phonetic unit [14] .29

Table 2.5 The six Vietnamese tones 30

Table 3.1: Comparison between direct pattern and model pattern 50

Table 4.1: Prosody corpus structure 52

Table 4.2: Prosody corpus text information 53

Table 4.3: Recording information of Prosody corpus 54

Table 5.1: Confusion matrix (in %) for 8 tones with male voice 75

Table 5.2: Confusion matrix (in %) for 8 tones with female voice 75

Table 5.3: Confusion matrix (%) of sentence types with male voice 76

Table 5.4: Confusion matrix (%) of sentence types with female voice 77

Table 5.5: Test data for Experiment 2 79

Table 5.6: Confusion matrix (in %) of sentence types (with male voice) 82

Table 5.7: Confusion matrix (in %) of sentence types (with female voice) 83

Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female) .84

Table 5.9: Correct recognition rate (%) with other types of sentences 86

Table 5.10: Result of three experiments 87

Trang 10

Master thesis

Table of contents

Acknowledgment 1

Abstract 2

List of Figures 4

List of Tables 6

Table of contents 7

0 INTRODUCTION 9

1 PROSODY AND PROSODIC MODEL 12

1.1 Overview of prosody 12

1.1.1 The concept of prosody 12

1.1.2 Major components of prosody 13

1.1.3 The functions of prosody 14

1.1.4 Levels of representation of prosodic phenomena 16

1.2 Prosody modeling 17

1.2.1 Intonation models 18

1.2.2 Duration modeling 21

1.2.3 This thesis work approach 23

2 VIETNAMESE LANGUAGE AND PROSODY 25

2.1 Vietnamese language 25

2.1.1 Vietnamese characteristics 25

2.1.2 Vietnamese phoneme system 27

2.1.3 Syllable structure 29

2.2 Vietnamese prosody 29

2.2.1 Micro-prosody and tones system in Vietnamese 30

2.2.2 Macro-prosody and sentence types in Vietnamese 34

2.2.3 Some special phenomena in Vietnamese prosody 38

3 TTS SYSTEM AND PROSODY GENERATION 40

3.1 An overview of TTS system 40

3.2 Prosody generation 41

3.2.1 Overview of prosody generation 41

3.2.2 From text to prosody 43

3.3 Other researches and our proposal 45

4 PROSODY PATTERNS EXTRACTION 51

4.1 Prosody corpus 51

Trang 11

Master thesis

4.1.1 Objectives 51

4.1.2 Define the corpus text 52

4.1.3 Recording 54

4.1.4 Sentence segmentation 54

4.2 Analysis and extracting prosody parameters 55

4.2.1 Segmentation 55

4.2.2 Extracting prosody parameters of key-syllable 56

4.3 Proposal the patterns for Vietnamese prosody 58

4.3.1 Methodology 58

4.3.2 Prosody patterns 59

4.3.3 Some visual remarks on extracted patterns 70

5 EXPERIMENTS AND EVALUATION 72

5.1 Experiment 1: Tone and non-sense phrase 72

5.1.1 Objectives 72

5.1.2 Method and Implementation 72

5.1.3 Results and discussion 74

5.2 Experiment 2: Multi-type sentences 79

5.2.1 Objectives 79

5.2.2 Method and Implementation 79

5.2.3 Results and discussion 82

5.3 Comparison and conclusion 87

6 CONCLUSION AND PERSPECTIVES 89

REFERENCES 92

APPENDIX 95

A Text for prosody corpus 95

B: Datasheet of prosody patterns 100

Trang 12

to reach to human speech

In Vietnam, there are currently some Vietnamese synthesis system like VnVoice (develop by Institute of Information Technology) or HoaSung (develop by International Research Center MICA) These researches obtained some encouraging results However, to release their systems to the market yet, they have to improve the produced speech quality, especially the naturalness of speech prosody

Thus, this thesis aims to study the characteristics of Vietnamese prosody for applying to synthesize the speech This work is carried out in International research center of Multimedia Information, Communication and Application (MICA) and is part of MICA’s project: VN-Synthesis

With the research of PhD student Tran Do Dat in MICA, we have already developed a speech synthesis system using sound samples concatenation techniques The first version now can produce sound from detailed text description, which consists of:

Trang 13

Chapter 0: Introduction

• The sequence of phonemes for composing the utterance: can be obtained automatically from the raw text using a "phonetization” module, whose development is currently underway

• All information related to voice modulations: mostly pitch, energy and duration variations that constitute the intonation or prosody of the uttered statement We call it “prosody description”

For tonal languages such as Vietnamese, the prosody of speech is composed of two components, which we call “micro-prosody” and “macro-prosody”:

• Micro-prosody is the variations of pitch, duration and intensity of individual word or syllable For tonal language, the micro-prosody is very important to distinguish the syllable’s tone Thus, the meaning of the synthesized sound greatly depend on the quality of micro-prosody

• Macro-prosody is the application of prosody to whole phrase or sentence

It depends on the type of sentence, speaker's intentions, the emotions etc Therefore, the "naturalness" of synthesized speech is depends on ability

of macro-prosody controlling during speech synthesis process

Objectives and Tasks

This thesis is part of MICA speech synthesis research and its main goal is to extract characteristics of Vietnamese prosody to generate the “prosody description” for speech synthesis

In this thesis, we just focus on the differences of Vietnamese tones in different positions in the sentence and in different types of sentences In other words, these are the influences of macro-prosody on micro-prosody

The first task is setting up a corpus for researching Vietnamese prosody With this corpus, we extract and analysis parameters of fundamental frequency, duration and intensity of the syllables in eight Vietnamese tones, in three positions and in three

Trang 14

Chapter 0: Introduction

After that, using these prosody parameters, we defined the simple prosody patterns for Vietnamese tones, corresponding to the cases of syllable in three types of sentence: assertive, interrogative and imperative By applying these patterns to re-synthesize some simple sentences and doing some perception experiment, we can examine the appropriateness of these prosody patterns

Thesis outline

This thesis is structured as follows:

• Chapter 1 starts with Section 1.1 giving some background on prosody, also some definitions and some term we use in this thesis book Section 1.2 briefly presents modeling prosody and some prosodic models

• Chapter 2 gives an overview of Vietnamese language and Vienamese prosody

• Chapter 3 starts with the introduction of Text-to-Speech system, the general structure of TTS system and the prosody generation In last section of this chapter, we present some related work and propose a simple structure for prosody generation module for TTS system

• Chapter 4: Section 3.1 and 3.2 describes our work of setting up and analyzing the Vietnamese prosody corpus In section 3.3, we propose set

of prosody patterns for the Vietnamese syllables

• In chapter 5, a series of perception experiments is presented for evaluating our proposal patterns

• Chapter 6 completes with the conclusions from the work presented in the thesis and suggestions for further work

Trang 15

Chapter 1: Prosody and Prosodic model

1

Prosody and Prosodic model

In this chapter, we give an overview of prosody and explain some terms we use in this thesis The concept of modeling prosody and some prosodic models are also briefly presented after that

1.1 Overview of prosody

1.1.1 The concept of prosody

There is not an exact definition of the term “prosody” We can use the term

"prosody" broadly, meaning “a time series of speech-related information that is not predictable from a reasonable window (i.e word-sized or sentence-sized) applied to the phoneme sequence” [1]

Viewed in the large, prosody is a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel All aspects of prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker

Clearly, with that broad definition of prosody, hand gestures, eyebrow and face motions, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel However, in the domain of speech processing, we concentrate on the aspect of speech of prosody Thus, the

Trang 16

Chapter 1: Prosody and Prosodic model

signal, the prosody is represented by three components: “Fundamental frequency (F0)”, “Duration” and “Intensity”

“Prosody” and “Intonation”

The term prosody refers to certain properties of the speech signal such as audible changes in pitch, loudness, and syllable length For some authors the set of prosodic features also includes other aspects related to speech timing such as rhythm and speech rate [13]

Some as a synonym for prosody use the term intonation It is restricted to the tonal (melodic) aspects of prosody by others In the thesis, intonation refers to pitch variation in speech production and is part of prosody [13] In other words, we have:

Prosody = Intonation + Duration

1.1.2 Major components of prosody

As we discuss above, the prosody consist of:

• Pitch (Fundamental frequency): Among prosodic event, the most overt are changes in pitch, which together constitute the pitch contour of the utterance (F0 contour of speech signal) Some analysis of sentences-lever pitch contours show that the pitch contour of longer utterances can be broken down to a sequence of elementary contours, which can further be divided into syllabic contours [13]

• Duration: duration in prosody is concerning to the length of sentence, phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on The duration of syllable and speech sounds depends on several (dependent or interdependent) factor such as speech rate, rhythm, phonetic nature, etc Most of case, the absolute duration of an event is easily measured However sometime, it is not obvious to define the boundary of an event

Trang 17

Chapter 1: Prosody and Prosodic model

• Stress (Intensity): stress is a prosodic property that has been described since the very first work on prosody in phonetics It was said to be related

to loudness and phonology force Both these characterizations refer to the perceptual form of prosody: the syllable carrying stress is prominent with respect to the surrounding syllables, either due to its loudness or to its dynamic properties

1.1.3 The functions of prosody

Prosody, as expressed in pitch, gives clues to many channels of linguistic and linguistic information Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement Intonation types and para-linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e boundary tones) [1]

para-Prosody used to convey lexical meaning: Stress, accentual and tone languages

• Stress language: English is an example of a stress language Stress location is part of the lexical entry of each English word For example,

"apple" and "orange" both have stress on the first syllable, while

"banana" has stress on the second syllable When an English word is spoken in isolation in declarative intonation, f0 typically peaks on the stressed syllable

• Accentual language: Japanese is an example of an accentual language A word is lexically marked as accented (on a particular syllable) or un-accented A simplified description is that pitch rises near the beginning of

an accentual phrase and falls on the accented syllable For detailed analysis, see Beckman and Pierrehumbert (1988)

• Tone language: Mandarin, Vietnamese are the examples of a lexical tone language Each syllable is lexically marked with one lexical tones ( Tones have distinctive pitch contours Altering the pitch contour may

Trang 18

Chapter 1: Prosody and Prosodic model

perhaps the meaning of a sentence For example in Vietnamese, the meaning of syllables “ta” (we), “tà” (lap of dress), “tã” (nappy), “tả” (to describe), “tá” (twelve), “tạ” (quintal) are different

Prosody used to convey non-lexical information: Intonation type (Question vs declarative sentences)

Languages may employ prosody in different ways to differentiate declarative sentences from questions A general trend is that questions are associated with higher pitch somewhere in the sentence, most commonly near the end This may be manifested as a final rising contour, or higher/expanded pitch range near the end of the sentence In English, declarative intonation is marked by a falling ending while yes-no question intonation is marked by a rising one, as shown on the last digit

"one" in the English examples Russian question, on the other hand, uses strong emphasis on a key word instead of a rising tail Chinese questions are manifested by

an expanded pitch range near the end of the sentences, however, the speaker preserves the lexical tone shapes [1]

Prosody used to convey discourse functions: Focus, prominence, discourse segments, etc

Topic initialization is typically associated with high pitch Pitch is typically raised

in the discourse initial section and lowered in the discourse final section Also, new information in the discourse structure is typically accented while old information de-accented [1]

Prosody used to convey emotion

Most experiments studying emotional speech study stylized emotion, as delivered

by actors and actresses In these acted-out emotions, a few categories of emotions can be reliably identified by listeners, and one can find consistent acoustic correlates of these categories For example, excitement is expressed by high pitch and fast speed, while sadness is expressed by low pitch and slow speed Hot anger is characterized by over-articulation, fast, downward pitch movement, and overall

Trang 19

Chapter 1: Prosody and Prosodic model

elevated pitch Cold anger shares many attributes with hot anger, but the pitch range

is set lower

The study of emotion in natural speech is a lot more complicated It is generally recognized that speakers show mixed feelings and ambiguous states of mind, and the emotions do not fall into clear cut categories.[1]

We have the summary of prosody functions in Table 1.1:

Table 1.1: Prosody functions

Linguistic (Lexicon

information)

Paralinguistic (non-lexicon information)

Discourse function Extra

- …

In this thesis work, we just focus on studying the functions of prosody which modify meaning, namely tones and sentence types in Vietnamese prosody

1.1.4 Levels of representation of prosodic phenomena

As for other properties of the speech signal, prosodic events can be studied at various levels of representation (see Table 1.2) [13]

• First, the acoustic level: the acoustic manifestation of prosody (fundamental frequency, amplitude, and duration) can be measured directly, using specialized hardware or algorithms (such as pitch determination algorithms)

• Second, the perceptual level represents the prosodic events as heard by the listener As for spectral properties of speech sounds, acoustic characteristics that can be measured are not always perceptible The perceptual representation is accessible to the individual listener, but this mental representation can hardly be measured Alternatively it can be

Trang 20

Chapter 1: Prosody and Prosodic model

• Finally, the linguistic level represents the prosody of an utterance as a sequence of abstract units (signs, symbols), some of which have a communicative function in speech, while others may just fulfill syntactic requirements The linguistic structure of prosody is not some hidden code that simply can be revealed using some standard procedure

Table 1.2:Links between levels of representation of prosodic phenomena [13]

Fundamental frequency (F0) Pitch Tone, intonation, aspect of stress

Given the different nature of these representations, it is important to keep them apart It can be helpful to have the terminology reflect the lever of representation For instance, measuring loudness does not equal measuring signal energy It is obvious that the perception of loudness is not exclusively related to the amplitude at one point of the signal, but also dependent on the duration of a speech fragment (the loudness of which we are measuring), and relative to the loudness of other parts in the signal

As one moves away from acoustic level towards the perceptual and/or linguistic levels, the measurement of some given prosodic property will progressively involve segmentation (for example, into syllables), context (such as relative prominence), and structural information (the linguistic interpretation of a syllabic tone, for example, often depends on whether the related syllable is stressed or not, which requires a prior analysis of the segmental layer)

1.2 Prosody modeling

Prosodic models serve two purposes: On one hand, they can be scientific hypotheses that explain how we communicate with each other, and what we communicate On the other hand, they can be engineered software systems that are part of a dialog system or speech synthesizer To a lesser extent - and this is mostly potential - a prosodic model can be the background for a system to recognize prosody in human speech

Trang 21

Chapter 1: Prosody and Prosodic model

In general, a prosodic model is combined of two component, they are: intonation model and duration model In this section, we word like to give an overview of some methods for prediction intonation (F0 contours) and duration which have actually been applied in speech synthesis

1.2.1 Intonation models

1.2.1.1 Intonation model classification

The primary goal of intonation research is to model natural f0 contours of speech, preferably in relation to a transcription and a description of the prosodic intent of the speaker The starting point of intonation research is the time series of F0 But the interpretation of the F0 information diverges widely among intonation models The Table 1.3 represents a view of how one can classify the various intonation models

Table 1.3: Intonation model classification Intonation model classified by the way they describe prosody

Under-specified - - Fully Specified Single Component INTSINT ToBI, Xu Tilt, IPO Olive, Machine learning Two components Grønnum - Fujisaki -

Under-specified or Fully specified

The shape of an accent may be fully-specified (i.e defined without gaps) or specified (defined by disconnected regions or isolated points) Along another dimension, f0 values at any given time may be treated as a single component or as the combination of multiple components

under-The advantage of using an under-specified accent shape is that it allows sufficient distance between specified accent targets to allow a smooth f0 transition, typically

by way of interpolation The drawback is that it ignores changes of shape between

Trang 22

Chapter 1: Prosody and Prosodic model

On the other hand, a system with fully specified accents leaves little room to resolve conflicting targets A simple concatenation of fully-specified accents will result in a pitch curve with unnatural jumps at the concatenation joints Many systems, such as Fujisaki (1983, 1988), use filters to smooth out abrupt changes in F0 Alternatively, van Santen (1997, 2000) requires each accent to begin and end at zero to ensure smooth connections between accents

Single component or many components?

Many intonation models treat surface intonation contours as the superposition of a phrase component and an accent component Grønnum (1992) and Fujisaki (1983, 1988) are representatives of this view

Well-defined model that fully specifies accent shape and uses multiple components

is Van Santen's model (van Santen and Möbius, 1997, 2000; van Santen et al., 1998), where accents are represented by densely populated points, providing a mechanism to describe highly complex accent shapes in detail We characterize van Santen's system as having multiple components, because in addition to the phrase component, each accent in the phrase also adds a phrase-length component that contributes to the surface f0 contour

The advantage of multiple components is that it provides a mechanism to separate individual accents from long-term effects However, if one allows multiple components, then one necessarily faces the problem that there is no unique solution

in the decomposition of a single f0 time series into multiple components [1] Any such decomposition depends on a model of the speech process, and is only as good

as the underlying model

In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a phrase curve and represent intonation contours as a single component The advantage of representing f0 information as a single component is that the representation of accent heights will then be transparent, which lends itself to convenient automatic labeling [1]

Trang 23

Chapter 1: Prosody and Prosodic model

1.2.1.2 Some prosody models

The following give an over view of intonation models in Table 1.3

• INTSINT (Hirst et al., 2000) is an underspecified intonation system that defines an accent by a single point Fitting quadratic spline curves through these points generates surface f0

• ToBI: The most widely used under-specified accent shape is represented

by the ToBI model (Beckman and Ayers, 1997; Silverman et al., 1992), which developed from earlier works such as Pierrehumbert (1980), Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman (1988) Each accent is represented by no more than two points, which specify abstractly the relative contrast of high (H) and low (L) One goal

of the ToBI system is to specify a minimal set of categorical labels for intonation These labels are usually interpreted as phonological distinction between accent types

• Xu: Xu et al (1999) represents Chinese tones with under-specified, static

or dynamic targets The surface f0 contours are generated with a model that approaches these targets asymptotically within the domain of a syllable

• Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near the peak of an accent and leaves the other regions unspecified, hence its status half way to a fully specified system Tilt considers all accent types

to be continuous variations of a single class Surface variations are accounted for by changes in the continuous parameters

• IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the pitch contour They then associate the slope and height of these lines with various types of accents Olive (1975) described a very early fully-specified system, following work by Levitt and Rabiner (1970) His model stored the surface pitch vs time contour as a function of the

Trang 24

Chapter 1: Prosody and Prosodic model

approximated by polynomial splines attached to words, to allow for duration variations

• Machine-learning: Several works using machine-learning techniques generate densely sampled f0 values, including Chen et al (1992) and Malfrère et al (1998) We classify these works as fully specified systems even though in some cases the concept of accent may not be clear Ross and Ostendorf (1999) described an interesting machine learning system where a discrete learning system would predict vectors attached to phonemes and syllables, and these vectors would in turn drive a (learned) dynamical system to predict f0

• Fujisaki: Fujisaki’s phonetic intonation model (Fujisaki and Kawai, 1982) Fujisaki’s model was developed from the filter method first proposed by O¨ hman (1967) Fujisaki states that intonation contours are comprised of two types of components, the phrase and the accent The production process is represented by a glottal oscillation mechanism which takes phrase and accent information as input and produces a continuous F0 contour as output The input to the mechanism is in the form of impulses, used to produce phrase shapes, and step functions which produce accent shapes [10] The Fujisaki model has been successfully applied for decomposing F0 contours in many languages like Japanese, German, and Finnish and in some tonal languages like Chinese, Thai Currently, some researches of applying Fujisaki model to Vietnamese are on the way [11] We will return to this model in Chapter

3

1.2.2 Duration modeling

We now give a general overview of modeling the duration component of prosody Common methods to predict duration in speech synthesis differ in the following aspects: [6]

Trang 25

Chapter 1: Prosody and Prosodic model

• Durational Unit Predicted: the temporal unit predicted by most current systems are either the phone (phoneme), often referred to as “segment”,

or the syllable Since eventually phone duration are required for the acoustic synthesis, all syllable-based models include some kind of mechanism for calculating segment duration from the unit syllable duration For example, in Barbosa and Bailly’s model, the basic unit is delimited by the onset of nuclear vowel and the onset of the following vowel They are computed by a sequential network constrained by an internal clock (basically the speaking rate)

• Predictor factors: Every model uses a particular vector of input features, which are extracted on the linguistic and phonetic levels Most commonly employed factors include:

 on the syllabic level: the degree of accentuation and the position in

a higher-level unit, such as the foot or accent group

 on the segmental level: the properties of the phone to be synthesized and its neighboring phones

 on the phrase level: the location of a segment with respect to a minor or major boundary an the position of the phrase in a sentence

• The Prediction Method: The algorithms used for calculating a numerical duration value from the vector of input features can be roughly divided into rule systems and statistical approaches In the Figure 1-1: Category

of methods for predicting syllable duration [6]Figure 1-1Error! Reference source not found., the statistical approaches are subdivided into parametric and non-parametric regression models Whereas the structure of a parametric regression model in term of how it processes the input factors is determined a priori, non-parametric regression models are developed by unsupervised training and the model structure is determined

Trang 26

Chapter 1: Prosody and Prosodic model

on relatively little speech data The formulation of the rules, however, require a high amount of expert knowledge and considerable optimization effort by trial-and-error In contrast, statistical approaches are built from a process is relatively effortless Furthermore, the importance of individual factors can be easily assessed by the way the statistical models prioritize them

Figure 1-1: Category of methods for predicting syllable duration [6]

• Pause Prediction Some current approaches incorporate the prediction of speech pauses as part of model, others treat pauses strictly separately

• Speech Rate: Many current TTS systems produce different speech rates

by linearly scaling the duration output by the duration model As the speech rate not only affects the duration of individual segments, but also the overall prosodic structure of an utterance, this kind of modification needs to take place on an earlier step of processing when the phrasal structure of an utterance is determined

1.2.3 This thesis work approach

Modeling the intonation and duration in prosody is a complex field, relate to linguistic and acoustic field There are many different methods to predict the

Trang 27

Chapter 1: Prosody and Prosodic model

intonation and duration of speech However, there is currently no methods completely apply in Vietnamese

In the scope of this thesis, we use the statistical approach to extract some basic patterns, just for modeling some basic cases in Vietnamese prosody

The following are some information about our approach to modeling Vietnamese prosody:

• Method: Statistical approach: calculate average value of F0, intensity and duration from a corpus

• Syllable level factors: Tone of syllable

• Extra-linguistic factors: Male/Female voice

This approach will be described more detail in Chapter 40

Trang 28

Chapter 2: Vietnamese language and prosody

2

Vietnamese language and prosody

The understanding of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general and on prosody analysis in particular Thus, in this chapter, we give a review of Vietnamese language and Vietnamese prosody

2 Vietnamese word structure does not use the affixes (prefixes, suffixes, and infixes) Vietnamese language is a non-affix language For instance, in French or in English, an antonyms of one words is add the prefixes “im-”,

“ir-”, “un-”: impolite, unreadable, irregular…

3 Vietnamese word structure uses very few morphemes Vietnamese language has maximum twenty thousand syllables to create morphemes, thus Vietnamese language does not have the features of flexional languages

Trang 29

Chapter 2: Vietnamese language and prosody

Vietnamese language’s morpheme index (number of morphemes M/ number

of words W) is about 1.06 [13], this is the least index in the 5000 languages

in the world [13] The language, which its morpheme is less than 2, is an analytic language

The amorphous feature of our language is an essential characteristic, which has an influence on other Vietnamese language’s characteristics

4 Vietnamese language is a tonal/musical language Vietnamese language has six tones, and each tone could contribute to create the morpheme and meaning of word, e.g ba, bá, bà, bả, bã, bạ; me, mé, mè, mẻ, mẽ, mẹ The tones make Vietnamese language have a musical characteristic; make sentences rhythmic and melodious

5 A syllable (isolated word) of Vietnamese language in full structure has five parts: initial sound (consonant), medial sound (semi-vowel), nucleus sound (vowel or diphthong), final sound (consonant or semi-vowel) and tone In one word, consonant and vowel take an essential role, they are the core of the word They can create one syllable by themselves Excepting the initial consonant, the rest of one word is called a final (vần) Vietnamese has 155 basic finals [13]

6 In Vietnamese, the boundary of syllable and morpheme’s is the same One syllable is one morpheme In French: partir (come) has two syllables par-tir and two morphemes part-ir, vendeur (seller) has two syllables ven-deur and two morphemes vend-eur In English: words have one syllable and two morphemes In Vietnamese: the sentence “ Đẹp vô cùng tổ quốc ta ơi!” (Tố Hữu) has seven morphemes, seven syllables, and five words (three mono words: đẹp, ta, ơi and two compound word: vô cùng, tổ quốc) In conclusion, one Vietnamese word unit is one syllable, one morpheme and one real word

Trang 30

Chapter 2: Vietnamese language and prosody

7 Almost Vietnamese vocabulary is created by one or two morphemes, and is monosyllable or bi-syllable, sometime polysyllable There are 80% words being bi-syllable words

8 The difference between writing language and speaking language on grammatical rules and phonetic rules is not large

9 Through the period of foundation and development of Vietnamese language,

it has received quite many words from foreign languages Number of Han words is the greatest and next are French words, and a part of them were converted fully into Vietnamese For example, words: đấu tranh, giai cấp, hoà bình, độc lập, tự do, hạnh phúc are Han words (Chinese words) Nhà ga (gare), xà phòng (savon), cà phê (café) are French words

2.1.2 Vietnamese phoneme system

Vietnamese phoneme system includes 14 vowels or vowel combinations and 22 consonants

The Vietnamese vowels include 11 vowels and three diphthongs [21] All vowels are voiced sounds

Table 2.1:Vietnamese vowels

Transcription Reading Letters Example

/ie/ ia , yê ia, iê, ya,yê kia kìa, yêu kiều

Trang 31

Chapter 2: Vietnamese language and prosody

Vietnamese includes 22 consonants [21] as Table 2.2

Table 2.2:Vietnamese consonants

Transcription Reading Letter Example

/z/ dê, giê and dờ d, gi duyên dáng, giữ gìn

Table 2.3: Arrangement of Vietnamese consonants

apical articulate position

articulate method labial dental laminal palate dorsal glottal

noise

non-aspirate Voiced b d Stop

Trang 32

Chapter 2: Vietnamese language and prosody

2.1.3 Syllable structure

Vietnamese grammarians and linguists have long considered the syllable in Vietnamese as a fundamental unit A syllable in full structure (a tonal syllable) has five parts: initial sound, medial sound, nucleus sound, final sound and tone (Error! Reference source not found.) [21] For instance, the syllable “toán” has following components: initial sound /t/, medial sound /o/, nucleus sound /a/, final sound /n/, and tone “sắc” (or rising tone) One syllable has to have a nucleus sound Other components are optional A nucleus sound could create one syllable, for instance a,

ô, ê…Besides the initial sound (called INITIAL part), the rest of the syllable is called a FINAL part A tone is a fundamental frequency variation spreading over the whole syllable A tone has the same function as a phoneme It always assigns for syllable and its influence covers the entire of syllable There are a few constraints: if

a syllable ends with unvoiced consonants /p,t,k/, only “sắc” and “nặng” tones are possible; otherwise in all varieties of Vietnamese, the whole tonal paradigm can occur

Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each

phonetic unit [14]

TONAL SYLLABLE (6492) BASE SYLLABLE (2376)

Final (155) Medial (1) Nucleus (16) Ending (8)

Initial (22)

TONE (6) 2.2 Vietnamese prosody

As a tonal language, Vietnamese prosody is composed of two components, which

we call “micro-prosody” and “macro-prosody”:

• Micro-prosody is the variation of pitch, duration and intensity of individual word or syllable For tonal language, the micro-prosody is very important to distinguish the syllable’s tone Thus, the lexical meaning of the synthesized sound much depends on the quality of micro-prosody

• Macro-prosody is the application of prosody to whole phrase or sentence It depends on the type of sentence and speaker's intentions or

Trang 33

Chapter 2: Vietnamese language and prosody

emotions Therefore, the "naturalness" of synthesized sentences is much depends on ability of macro-prosody controlling during speech synthesis process

2.2.1 Micro-prosody and tones system in Vietnamese

In Vietnamese, micro-prosody is much depends on the tone of syllable Each tone could contribute to construct the morpheme and meaning of word, it is also a distinguish signal The tone has the same function as a phoneme, it always assigns for syllable and its influence cover the entire of syllable The tones make Vietnamese language have a musical characteristic; make sentences rhythmic and melodious

There are six tones in Vietnamese; they are showed in the Table 2.5

Table 2.5 The six Vietnamese tones

Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6 ngang huyền ‘\’ ngã ‘~’ hỏi ‘?’ sắc ‘/’ nặng ‘.’

Figure 2-1: Example of the contours of six tones, as described in [21]

• Tone 1- Level tone (“ngang”): is a high tone At the beginning of syllable, it is the highest tone The steady state of the level contour is

Trang 34

Chapter 2: Vietnamese language and prosody

1 for male and female voice (two line present the maximum and minimum of F0 values)

Figure 2-2: The shape of Tone 1 with female and male voice [18]

• Tone 2 - Falling tone (“huyền”): the onset of the falling tone is lower than tone 1, tone 5 and tone 3 The low F0 at the onset gradually falls toward the end

Figure 2-3: The shape of Tone 2 with female and male voice [18]

• Tone 3 - Broken tone (“ngã”): the onset is as high as that of the level

of tone 5, it is higher than the falling tone The second third of the contour of this tone is characterized by an abrupt dip caused by a glottalization In most cases, the bottom of the dip occurs between the mid-point and the point two-thirds from onset A creaky voice is heard during this dip

Trang 35

Chapter 2: Vietnamese language and prosody

Figure 2-4: The shape of Tone 3 with female and male voice [18]

• Tone 4 - Curve tone (“hỏi”): the onset is the lowest among the six tones The low onset falls further gradually until the point two-thirds from the onset From this point, the extremely low F0 starts to rise toward the end

Figure 2-5: The shape of Tone 4 with female and male voice [18]

• Tone 5 - Rising tone (“sắc”): the onset is also high Starting from high onset, the F0 gradually rises for the first two thirds of the duration After this point, the rise becomes more rapid

Trang 36

Chapter 2: Vietnamese language and prosody

• With tone 5 ending with stop consonants (t,p,c,k), the onset is higher than tone 5a and the F0 rise rapidly with short duration We call that tone is tone 5b

Figure 2-7: The shape of Tone 5b with female and male voice [18]

• Tone 6 - Drop tone (“nặng”): the onset is usually higher than that of the falling or curve tone but considerably lower than the tone 1, tone 5 and tone 3 This tone is characterized by a glottalization at the end and also by its considerably shorter duration than the other tones The duration of this tone is approximately two thirds of the other tones The main body of this tone is almost leveled or slightly falling

Figure 2-8: The shape of Tone 6 with female and male voice [18]

• Tone 6b (tone six ending with stop consonants): the onset is nearly equal tone 2 The F0 falls toward the end with short duration

Trang 37

Chapter 2: Vietnamese language and prosody

Figure 2-9: The shape of Tone 6b with female and male voice [18]

These descriptions are only for the Northern dialect, in particular Hanoi dialect which is the standard dialect of Vietnamese They would be changed with the other dialects in the South and the Center of Vietnam In these regions, there are only 5 tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced identically

In continuous speech, tones seldom reach their target values They are generally affected by context: stressed vs unstressed syllable, influence of neighbouring tones, tempo… and the affect of some phenomena in Vietnamese prosody, on which

we will discuss later

2.2.2 Macro-prosody and sentence types in Vietnamese

As we talked above, the macro-prosody depends on the type of sentence, speaker's intentions or emotions In this thesis, we just discuss on role of sentence types in Vietnamese prosody

The classifition of Vietnamese sentences

• Classification by purpose: Sentences can be classified based on their purpose:[8]

 A assertive sentence or declaration: the most common type, commonly makes a statement Ex: Tôi sẽ về nhà (I am going home.)

Trang 38

Chapter 2: Vietnamese language and prosody

 An interrogative sentence or question: is commonly used to request information Ex: Khi nào anh sẽ làm việc? (When are you going to work?)

 An imperative sentence or command: is ordinarily used to make a demand or request Ex: Mở cửa ra! (Open the door!)

 An exclamatory sentence or exclamation: is generally a more emphatic form of statement: Ex: Ngày hôm nay tuyệt quá! (What a wonderful day this is!)

• Classification by structure: Sentences can also be classified based on their structure (by the number and types of finite clauses) as the below diagram

Figure 2-10: Sentence classification by structure [20]

With the scope of this thesis, we have just studied the macro-prosody of assertive, interrogative and imperative sentence with single structure

In the researches of Nguyễn and Boulakia [8], they gave some characteristics of prosody on three types of sentence (assertive, interrogative and imperative) as the following:

• Duration (Tempo) Interrogative sentences (Q) are shorter than Assertive sentence (S) and this difference is significant Imperatives (I) are even shorter, but the differences with Q and S are not significant

Trang 39

Chapter 2: Vietnamese language and prosody

• Intensity: The difference is significant between assertive and imperative for the S/I pair, but not for the S/Q and Q/I ones

• Fundamental frequency The F0 mean value of Interrogative sentences and Imperative utterances is higher than that of Statements, while there is

no difference between Interrogative and Imperative sentences There is an obvious difference in the last syllable The phonologically "level" (high) tone falls in Statements and is much higher and rising in Questions, while the mean value and movement is half way between for Imperatives The rising tones, rise even more in the case of Interrogative and Imperative than in Statement sentences It means that there is an influence of the intonation on the final-syllable tone of the sentence

Figure 2-11: The sentences “Lan thích ăn cơm không” in Assertive (S) and Interrogative (Q) mode [8]

Trang 40

Chapter 2: Vietnamese language and prosody

Assertive (S) and Imperative (Q) mode [8]

Figure 2-13: The sentences “Tân bỏ đi chứ” in Interrogative (Q) and Imperative (I) mode [8]

In the research of Vu M Q et al [16], they found that the main part of differences in intonation is at the end of the sentence (zone located on Figure 2-14 after the vertical bar): the contour of the last syllable or of its second half tends to increase for the interrogative sentences

Figure 2-14: The differences of F0 contour between Assertive and Interrogative sentence

[16]

Ngày đăng: 28/02/2021, 00:01

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1]. Chilin Shih , Greg Kochanski, “Prosody and Prosodic Models”, www.prosodies.org Sách, tạp chí
Tiêu đề: Prosody and Prosodic Models
Tác giả: Chilin Shih, Greg Kochanski
[2]. Do T.D., Tran T.H., et al. (1998), “Intonation system - A survey of twenty languages”, chap. 22, Cambridge University Press Sách, tạp chí
Tiêu đề: Intonation system - A survey of twenty languages
Tác giả: Do T.D., Tran T.H., et al
Nhà XB: Cambridge University Press
Năm: 1998
[3]. Dung Tien Nguyen, Hansjửrg Mixdorff, Mai Chi Luong, Huy Hoang Ngo, Bang Kim Vu, (2005) “Fujisaki Model based F0 contours in Vietnamese TTS”, Eurospeech proceeding Sách, tạp chí
Tiêu đề: Fujisaki Model based F0 contours in Vietnamese TTS
Tác giả: Dung Tien Nguyen, Hansjửrg Mixdorff, Mai Chi Luong, Huy Hoang Ngo, Bang Kim Vu
Nhà XB: Eurospeech proceeding
Năm: 2005
[4]. H. Fujisaki, S. Ohno, C. Wang (1974), “A command-response model for F0 contour generation in multilingual speech synthesis”, Journal of Phonetics, vol. 2, pp 223-232 Sách, tạp chí
Tiêu đề: A command-response model for F0 contour generation in multilingual speech synthesis
Tác giả: H. Fujisaki, S. Ohno, C. Wang
Nhà XB: Journal of Phonetics
Năm: 1974
[5]. Mixdorff H. (1998), “Intonation patterns of German - Model-based quantitative analysis and synthesis of F0 contours”, PhD thesis, TU Dresden Sách, tạp chí
Tiêu đề: Intonation patterns of German - Model-based quantitative analysis and synthesis of F0 contours
Tác giả: Mixdorff H
Năm: 1998
[6]. Mixdorff H. (2001), “An Integrated Approach to Modeling German Prosody”, TU Dresden Sách, tạp chí
Tiêu đề: An Integrated Approach to Modeling German Prosody
Tác giả: Mixdorff H
Nhà XB: TU Dresden
Năm: 2001
[7]. Mixdorff H., Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong (2003), “Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese”, Eurospeech proceeding Sách, tạp chí
Tiêu đề: Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese
Tác giả: Mixdorff H., Nguyen Hung Bach, Hiroya Fujisaki, Mai Chi Luong
Nhà XB: Eurospeech proceeding
Năm: 2003
[8]. Nguyen T.T.H. and Boulakia G. (1999), "Another look at Vietnamese intonation", ICPhS'99 Sách, tạp chí
Tiêu đề: Another look at Vietnamese intonation
Tác giả: Nguyen T.T.H. and Boulakia G
Năm: 1999
[9]. Ninh Khanh Duy (2005), “Characterization of Vietnamese intonation for questions”, Master Thesis, Hanoi University of Technology Sách, tạp chí
Tiêu đề: Characterization of Vietnamese intonation for questions
Tác giả: Ninh Khanh Duy
Nhà XB: Hanoi University of Technology
Năm: 2005
[11]. Sami Lemmetty (1999), “Review of Speech Synthesis Technology”, MSc thesis, Faculte Helsinki University of Technology Sách, tạp chí
Tiêu đề: Review of Speech Synthesis Technology
Tác giả: Sami Lemmetty
Năm: 1999
[12]. Thierry Dutoit (1993), “High Quality Text-To-Speech Synthesis of the French Language”, PhD thesis, Faculte Polytechnique de Mons, TCTS Lab, Belgium Sách, tạp chí
Tiêu đề: High Quality Text-To-Speech Synthesis of the French Language
Tác giả: Thierry Dutoit
Nhà XB: Faculte Polytechnique de Mons
Năm: 1993
[13]. Thierry Dutoit (1997), “An Introduction to Text-to-Speech Synthesis”, Kluwer Academic Publishers Sách, tạp chí
Tiêu đề: An Introduction to Text-to-Speech Synthesis
Tác giả: Thierry Dutoit
Nhà XB: Kluwer Academic Publishers
Năm: 1997
[14]. Tran D.D., Castelli E., et al. (2005), "Influence of F0 on Vietnamese syllable perception", Interspeech Sách, tạp chí
Tiêu đề: Influence of F0 on Vietnamese syllable perception
Tác giả: Tran D.D., Castelli E
Nhà XB: Interspeech
Năm: 2005
[15]. Tran Do Dat (2003), “Building a large Vietnamese Speech Database”, Master Thesis, Hanoi University of Technology Sách, tạp chí
Tiêu đề: Building a large Vietnamese Speech Database
Tác giả: Tran Do Dat
Nhà XB: Hanoi University of Technology
Năm: 2003
[16]. Vu M.Q., Tran D.D., Castelli E. (2006), “Prosody of Interrogative and Affirmative Sentences in Vietnamese Language: Analysis and Perceptive Results” Sách, tạp chí
Tiêu đề: Prosody of Interrogative and Affirmative Sentences in Vietnamese Language: Analysis and Perceptive Results
Tác giả: Vu M.Q., Tran D.D., Castelli E
Năm: 2006
[20]. Mai Ngọc Chừ, Vũ Đức Nghiệu, Hoàng Trọng Phiến (2005) “Cơ sở ngôn ngữ học và tiếng Việt”, NXB Giáo dục Sách, tạp chí
Tiêu đề: Cơ sở ngôn ngữ học và tiếng Việt
Tác giả: Mai Ngọc Chừ, Vũ Đức Nghiệu, Hoàng Trọng Phiến
Nhà XB: NXB Giáo dục
Năm: 2005
[19]. Bạch Hưng Nguyên, Nguyễn Tiến Dũng,(2005), "Mô hình Fujisaki và áp dụng trong phân tích thanh điệu tiếng Việt&#34 Khác

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w