Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language) Nghiên cứu tổng hợp tiếng nói cho ngôn ngữ ít nguồn tài nguyên theo hướng thích nghi, ứng dụng với tiếng Mường (Speech Synthesis for LowResourced Languages based on Adaptation Approach Application to Muong Language)
Trang 1DECLARATION OF AUTHORSHIP
I, Pham Van Dong, declare that the dissertation titled “Speech Synthesis for
Low-Resourced Languages based on Adaptation Approach: Application to Muong Language” has been entirely composed by myself I assure you of some points as follows:
This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology
The work has not been submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institution
Appropriate acknowledgment has been given within this dissertation, where reference has been made to the published work of others
The dissertation submitted is my own, except where work in the collaboration has been included The collaborative contributions have been indicated
Hanoi, December 8, 2023
Ph.D Student
Pham Van Dong
ADVISORS
1 Dr Mac Dang Khoa
2 Assoc Prof Tran Do Dat
Trang 2ACKNOWLEDGMENT
Foremost, I would like to express my most sincere and deepest gratitude to my thesis advisors Dr Mạc Đăng Khoa (Speech Communication Department, MultiLab at MICA), Prof TRẦN Đỗ Đạt (The Ministry of Science and Technology, Vietnam) for their continuous support and guidance during my Ph.D program, and for providing me with such a severe and inspiring research environment I am grateful to Dr Mạc Đăng Khoa for his excellent mentorship, caring, patience, and immense Text-To-Speech (TTS) knowledge His advice helped me in all the research and writing of this thesis I am very thankful to Prof Đạt for shaping my thesis at the beginning and for their enthusiasm and encouragement Prof Trần Đỗ Đạt substantially facilitated my Ph.D research, especially when I was a freshman on speech processing and TTS, with his valuable comments on Vietnamese and Muong TTS
I thank all MICA members for their help during my Ph.D study My sincere thanks to Dr Nguyen Viet Son, Assoc Prof Dao Trung Kien and Dr Do Thi Ngoc Diep for giving me much support and valuable advice Thanks to Nguyen Van Thinh, Nguyen Tien Thanh, Dang Thanh Mai, and Vu Thi Hai Ha for their help I want to thank my Hanoi University of Mining and Geology colleagues for all their support during my Ph.D study Special thanks to my family for understanding my hours glued to the computer screen
Hanoi, December 8, 2023
Ph.D Student
Trang 3ABSTRACT
Text-to-speech (TTS) synthesis is the automatic conversion of text into speech Typically, building high-quality voiceovers requires collecting tens of hours of the voice of a professional speaker with a high-quality microphone There are about 7,000 languages spoken worldwide, but only a few languages, such as English, Spanish, Mandarin, and Japanese, are used in good TTS With so-called "low-resourced languages" or even languages that are not yet written, these languages do not have TTS Thus, to apply TTS technology to low-resourced language, it is necessary to study other TTS methods
In Vietnam, Vietnamese is the mother tongue and is used the most The Muong is a group
of the language spoken by the Muong people of Vietnam They are in the Austroasiatic language family and are closely related to Vietnamese, and Muong is also one of the five ethnic groups with the largest population However, Muong still needs an official script, a typical representative of the low-resourced language in Vietnam Therefore, researching TTS technologies to create TTS for the Muong language is challenging
In the first part of this thesis, we do an overview of TTS Researching the phonetics of Vietnamese and Muong languages, the thesis has also researched and published some tools to support TTS technology for Vietnamese and Muong languages In the rest of the thesis, we conduct various experiments in creating TTS for low-resourced language; specifically, we experiment with the Muong language We focus on two main low-resourced language groups:
Written: We use emulating to simulate the reading of the Muong language using Vietnamese TTS and cross-lingual adaptation transfer-learning
Unwritten: We experiment with adaptation in two directions The first is to create Muong speech synthesis directly from Vietnamese Text and Muong voice The second is to create Muong speech synthesis from translation through intermediate representation
We hope our findings can serve as an impetus to develop speech synthesis for low-resourced languages worldwide and contribute to the basis for speech synthesis development for 53 ethnic minority languages in Viet Nam
Hanoi, December 8, 2023
Ph.D Student
Trang 4CONTENT DECLARATION OF AUTHORSHIP I ACKNOWLEDGMENT II ABSTRACT III CONTENT IV ABBREVIATIONS VIII LIST OF TABLES X LIST OF FIGURES XII
INTRODUCTION 1
PART 1 : BACKGROUND AND RELATED WORKS 5
CHAPTER 1 OVERVIEW OF SPEECH SYNTHESIS AND SPEECH SYNTHESIS FOR LOW-RESOURCED LANGUAGE 6
1.1 Overview of speech synthesis 6
1.1.1 Overview 6
1.1.2 TTS architecture 8
1.1.3 Evolution of TTS methods over time 9
1.1.3.1 TTS using unit-selection method 10
1.1.3.2 Statistical parameter speech synthesis 11
1.1.3.3 Speech synthesis using deep neural networks 13
1.1.3.4 Neural speech synthesis 14
1.2 Speech synthesis for low-resourced languages 19
1.2.1 TTS using emulating input approach 20
1.2.2 TTS using the polyglot approach 22
1.2.3 Speech synthesis for low-resourced language using the adaptation approach 25
1.3 Machine translation 27
1.3.1 Neural translation model 28
1.3.2 Attention in neural machine translation 29
1.3.3 Statistical machine translation based on phrase 30
1.3.3.1 Statistical machine translation problem based on phrase 30
1.3.3.2 Translation model and language model 31
1.3.3.3 Decode the input sentence in the translation system 32
1.3.3.4 Model for building a statistical translation system 34
1.3.4 Machine translation through intermediate representation 34
1.3.5 Speech translation for unwritten low-resourced languages 36
1.4 Speech synthesis evaluation metrics 38
1.4.1 Mean Opinion Score (MOS) 38
1.4.1.1 Definition 38
1.4.1.2 Formula 38
1.4.1.3 Significance 38
1.4.1.4 Confidence Interval (CI) 39
Trang 51.4.2.1 Concept 39
1.4.2.2 Formula 39
1.4.2.3 Significance 40
1.4.2.4 MCD with Dynamic Time Warping (MCD – DTW) 40
1.4.3 Analysis of variance (Anova) 40
1.4.4 Intelligibility 42
1.5 Conclusion 42
CHAPTER 2 VIETNAMESE AND MUONG LANGUAGE 44
2.1 Vietnamese language 44
2.1.1 History of Vietnamese 44
2.1.2 Vietnamese phonetic system 45
2.1.2.1 Vietnamese syllabus structure 46
2.1.2.2 Vietnamese phonetic system 47
2.1.2.3 Vietnamese tone system 49
2.2 Muong language 50
2.2.1 Overview of Muong people and Muong language 50
2.2.1.1 Muong history 50
2.2.1.2 Viet Muong group 51
2.2.1.3 Muong dialects 53
2.2.1.4 Muong written script 54
2.2.2 Muong phonetics system 55
2.2.2.1 Muong syllable structure 55
2.2.2.2 Muong phoneme system 55
2.2.2.3 Muong tone system 57
2.3 Comparison between Vietnamese and Muong 57
2.4 Dicussion and proposal approach 60
PART 2 : SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE 61
CHAPTER 3 EMULATING OF THE MUONG TTS BASED ON INPUT TRANSFORMATION OF THE VIETNAMESE TTS 62
3.1 Proposed method 63
3.1.1 Muong G2P module 64
3.1.2 Muong emulating IPA module 65
3.2 Experiment 65
3.2.1 Testing materials 66
3.2.2 Experiment protocol 67
3.2.3 Results 68
3.2.4 Analysis by ANOVA method 72
3.2.4.1 MOS analysis by ANOVA 72
3.2.4.2 Intelligibility analysis by ANOVA 75
3.3 Conclusion 77
Trang 6CHAPTER 4 CROSS-LINGUAL TRANSFER LEARNING FOR MUONG
SPEECH SYNTHESIS 78
4.1 Proposed method 78
4.2 Experiment 82
4.2.1 Dataset 82
4.2.1.1 Vietnamese data 82
4.2.1.2 Muong Project‘s data 84
4.2.1.3 Muong fine-tuning data 84
4.2.2 Graphemes to phonemes 85
4.2.3 Training the pretrained model using Vietnamese dataset 86
4.2.4 Finetuned TTS model on Muong datasets 87
4.3 Evaluation 88
4.4 MOS analysis by ANOVA 91
4.5 Conclusion 94
PART 3 : SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE 96
CHAPTER 5 GENERATE UNWRITTEN LOW-RESOURCED LANGUAGE’S SPEECH DIRECTLY FROM RICH-RESOURCE LANGUAGE’S TEXT 97
5.1 Introduction 97
5.2 Proposed method 98
5.2.1 Model architecture 98
5.2.2 Database 99
5.2.3 Training the speech synthesis system 100
5.2.4 Evaluation 100
5.2.5 MOS analysis by ANOVA 105
5.2.5.1 ANOVA analysis in Muong Bi speech synthesis 105
5.2.5.2 ANOVA analysis in Muong Tan Son speech synthesis 108
5.3 Conclusion 111
CHAPTER 6 SPEECH SYNTHESIS FOR UNWRITTEN LOW-RESOURCED LANGUAGE USING INTERMEDIATE REPRESENTATION 112
6.1 Proposal Method 112
6.2 Experiment 114
6.2.1 Database building 114
6.2.2 System development 114
6.2.2.1 Text to phone translation 115
6.2.2.2 Phone to Sound Conversion 117
6.3 Evaluation 119
6.3.1 Evaluation in Muong Bi and Muong Tan Son 119
6.3.2 MOS analysis by ANOVA 122
6.3.2.1 ANOVA analysis in Muong Bi speech synthesis 122
6.3.2.2 ANOVA analysis in Muong Tan Son speech synthesis 125
6.4 Conclusion and comparison 128
Trang 7Conclusions 135
Future work 136
PUBLICATIONS 138
REFERENCES 139
APPENDIX A 1
A.1 Vietnamese and Muong phonetic 1
A.2 Muong G2P 4
A.3 Muong Vietnamese phone mapping 6
A.4 Information of Muong volunteers who participated in the assessment 9
A.5 Speech signal samples of the Muong TTS in chapter 5 12
Trang 8ABBREVIATIONS
Tree
ToolKit
A portable toolkit for building and manipulating hidden Markov models
Alphabet MARY
PSOLA
Time-Domain Pitch Synchronous OverLap and
A collection of machine learning algorithms for data mining tasks:
Methods Phonetic Alphabet
Trang 9G2P Grapheme to Phoneme
argument that gives the maximum value
from a target function
p(e | f) Conditional Probability
Trang 10LIST OF TABLES
Table 2.1 Vietnamese syllabus structure [94] 46
Table 2.2 Vietnamese syllabus structure [96] 46
Table 2.3 Vietnamese syllables based on structure 47
Table 2.4 Hanoi Vietnamese inital consonants 48
Table 2.5 The letter of initial consonant 48
Table 2.6 Hanoi Vietnamese final consonant 49
Table 2.7 Tone of Hanoi Vietnamese [108] 49
Table 2.8 Muong syllabic structure 55
Table 2.9 Muong final sound system 56
Table 2.10 Muong Hoa Binh tone system [115] 57
Table 2.11 Muong Bi and Muong Tan Son Tone 57
Table 2.12 Muong and Vietnamese phonetic comparison (orthography in normal, IPA in italic; Vi: Vietnamese; Mb: Muong Bi ; Mts : Muong Tan Son) 59
Table 2.13 Comparing the tone of Vietnamese with Muong Tan Son and Muong Bi 60
Table 3.1 Muong G2P Result Sample 64
Table 3.2 Examples of applying transformation rules to convert the Muong text into input text for Vietnamese TTS 65
Table 3.3 Testing material for emulating tone 66
Table 3.4 Testing material for emulating phone (the concerning phonemes in bold) 67
Table 3.5 Testing material for remaining phonemes 67
Table 3.6 ANOVA Results for MOS Test 73
Table 3.7 ANOVA Results for Intelligibility Test 75
Table 4.1 Parameters of acoustic model 80
Table 4.2 Vietnamese dataset information 83
Table 4.3 Muong recorded data 85
Table 4.4 The Muong split data set 85
Table 4.5 Parameter for optimizer 86
Table 4.6 Value of parameters when training Hifigan model 86
Table 4.7 The specifications of the in-domain and out-domain test sets 89
Table 4.8 Test set samples 89
Table 4.9 Evaluation results 90
Table 4.10 ANOVA Results for in-domain MOS Test 92
Table 4.11 ANOVA Results for out-domain MOS Test 93
Table 4.12 ANOVA Results for in/out domain MOS Test 94
Table 5.1 Evaluation Score 102
Table 5.2 TTS evaluation with in-domain test set 103
Table 5.3 TTS evaluation with out-domain test set 104
Table 5.4 ANOVA Results for in-domain MOS Test for Muong Bi 106
Table 5.5 ANOVA Results for out-domain MOS Test for Muong Bi 107
Table 5.6 ANOVA Results for Muong Bi in/out domain MOS Test 107
Table 5.7 ANOVA Results for in-domain MOS Test for Muong Tan Son 109
Table 5.8 ANOVA Results for out-domain MOS Test for Muong Tan Son 110
Table 5.9 ANOVA Results for Muong Tan Son in/out domain MOS Test 110
Table 6.1 Examples of labeling Vietnamese text into an intermediate representation of Muong Bi and Muong Tan Son phonemes 117
Table 6.2 Text information of Muong language datasets 118
Trang 11Table 6.5 ANOVA Results for in-domain MOS Test for Muong Bi 123
Table 6.6 ANOVA Results for out-domain MOS Test for Muong Bi 124
Table 6.7 ANOVA Results for Muong Bi in/out domain MOS Test 124
Table 6.8 ANOVA Results for in-domain MOS Test for Muong Tan Son 126
Table 6.9 ANOVA Results for out-domain MOS Test for Muong Tan Son 127
Table 6.10 ANOVA Results for Muong Tan Son in/out domain MOS Test 127
Table A.1 Vietnamese vowels 1
Table A.2 The Muong initial consonant 1
Table A.3 Muong vowels system 2
Table A.4 The correspondences between Vietnamese and Muong in 12 words refer to the human body parts [137] 4
Table A.7 Muong G2P 4
Table A.8 Muong Vietnamese phone mapping 7
Table A.9 Muong Hoa Binh volunteers 9
Table A.10 Muong Phu Tho volunteers 10
Trang 12LIST OF FIGURES
Figure 1.1 Basic system architecture of a TTS system [22] 8
Figure 1.2 Neural TTS architecture [3] 9
Figure 1.3 General and clustering-based unit-selection scheme: Solid lines represent target costs and dashed lines represent concatenation costs [13] 10
Figure 1.4 Core architecture of HMM-based speech synthesis system [25] 11
Figure 1.5 General HMM-based synthesis scheme [13, p 5] 12
Figure 1.6 A speech synthesis framework based on a DNN [29] 13
Figure 1.7 Encoder and Decoder diagram in Seq2Seq model 14
Figure 1.8 Char2Wav model [23] 17
Figure 1.9 Model of the Tacotron synthesis system [24] 18
Figure 1.10 Block diagram of the Tacotron 2 system architecture [25] 19
Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48] 23
Figure 1.12 Approaches to transfer TTS model from source language to target language [32] 26
Figure 1.13 Examples of sequence to sequence transformation [55] 28
Figure 1.14 Describe the location of the Attention model in neural machine translation 29
Figure 1.15 Example of translating an English input sentence into Chinese based on the phrase 31
Figure 1.16 Illustrate the process of translating a Spanish sentence into an English sentence [63] 33
Figure 1.17 Deploying a statistical translation system [67] 34
Figure 1.18 An ordinary voice translation system [11] 35
Figure 1.19 Model of the speech-to-speech machine translation system using intermediate representation for unwritten language 36
Figure 1.20 Voice-to-text translation system [83] 37
Figure 2.1 Mon-Khmer branch of the Austroasiatic family [109, pp 175–176] 51
Figure 2.2 Viet-Muong Group [110] 52
Figure 2.3 The distribution of the Muong dialects [114, p 299] 53
Figure 3.1 Emulating TTS for Muong 63
Figure 3.2 Muong G2P Module 64
Figure 3.3 Intelligibility Results for Muong emulating tones 69
Figure 3.4 Intelligibility Test Result for emulating close phonemes 70
Figure 3.5 Intelligibility Test Result for Equivalent phonemes 71
Figure 3.6 MOS Emulating Test Result 72
Figure 4.1 Low-resourced L2 TTS transfer learning from rich resource L1 79
Figure 4.2 Block diagram of the speech synthesis system architecture 80
Figure 4.3 Duration histogram 83
Figure 4.4 Duration distribution across the M_15m, M_30m, and M_60m datasets 85
Figure 4.5 Training loss and validation loss of pretrained TTS model 87
Figure 4.6 Training loss and validation error of Hifigan model 87
Figure 4.7 Training loss and validation loss of M_15m 88
Figure 4.8 Training loss and validation loss of M_30m and M_60m 88
Figure 5.1 System architecture 99
Figure 5.2 WaveGlow model architecture [136] 99
Figure 5.3 Muong Phu Tho training loss and validation loss after training acoustic model 100
Trang 13Figure 5.5 Testing interface 102
Figure 6.1 Training phase TTS L1 text to L2 speech system uses intermediate representation of phoneme level 113
Figure 6.2 Decoding phase TTS L1 text to L2 speech system uses intermediate representation of phoneme level 113
Figure 6.3 The result after manual annotation 114
Figure 6.4 Phone to sound module, as a speech synthesis from phone sequence 117
Figure 6.5 Muong Hoa Binh Training loss and validation loss after training acoustic model 118
Figure 6.6 Muong Phu Tho Training loss and validation loss after training acoustic model 118
Figure 6.7 Testing interface 119
Figure 6.8 Comparing the synthesized speech results on Muong Hoa Binh using three methods 129
Figure 6.9 Comparing the synthesized speech results on Muong Hoa Binh using three methods 130
Figure 6.10 Comparing the synthesized speech results on Muong Phu Tho using two methods 131
Figure 6.11 Comparing the synthesized speech results on Muong Phu Tho using two methods 132
Figure 6.12 Sumary of direction for low-resourced language speech synthesis 133
Figure A.1 Raw Muong Hoa Binh: ban vận động thành lập hội trí thức tỉnh ra mắt 13
Figure A.2 Muong Hoa Binh synthesis: ban vận động thành lập hội trí thức tỉnh ra mắt 13
Figure A.3 Muong Phu Tho raw: ban vận động thành lập hội trí thức tỉnh ra mắt 14 Figure A.4 Muong Phu Tho synthesis: ban vận động thành lập hội trí thức tỉnh ra mắt 14
Figure A.5 Muong Hoa Binh raw - Bố cháu ở nhà hay đi đâu 15
Figure A.6 Muong Hoa Binh synthesis: Bố cháu ở nhà hay đi đâu 15
Figure A.7 Muong Phu Tho raw: Bố cháu ở nhà hay đi đâu 16
Figure A.8 Muong Phu Tho synthesis: Bố cháu ở nhà hay đi đâu 16
Trang 15INTRODUCTION
Motivation
Today's speech-processing technology is essential in many aspects of human-machine interaction Many recent voice interaction systems have been introduced, allowing users to communicate with devices on various platforms, such as smartphones (Apple Siri, Google Cloud, Amazon Alexa, etc.), intelligent cars (BMW, Ford, etc.), and smart homes In these systems, one of the essential components is speech synthesis or Text-to-Speech (TTS), which can convert input text into speech Developing a TTS system for a language is not only the implementation of speech processing techniques but also requires linguistic studies such as phonetics, phonology, syntax, and grammar
According to statistics in the 25th edition of Ethnologue1 (regarded as the most comprehensive source of information on linguistic statistics), there are 7,151 living languages
in the world, belonging to 141 language families, of which 2,982 languages are not written Some languages have not been described in academic literature, such as dialects of ethnic minorities Machine learning methods based on big data do not immediately apply to low-resourced languages, especially unwritten ones The low-resourced/unwritten language processing field has started to pay attention in the past few years and has yet to have many results However, the research results of this field are essential because, in addition to bringing voice communication technologies to ethnic minority communities, products applying this technology are also essential It also contributes to the conservation of endangered languages Regarding the Vietnamese language and speech processing field, domestic research units have given it comprehensive attention and addressed various aspects, ranging from natural language processing problems such as text processing, syntactic component separation, and semantics to speech processing problems such as synthesis and recognition However, the problem of language and speech processing in general, including TTS) systems for minority languages without a writing system in Vietnam, has not received much attention due to the scarcity of data sources such as bilingual text data and speech data, as well as a lack of related linguistic studies
The Muong language presents unique linguistic characteristics that make it challenging to develop a TTS system, such as tonality and complex phonetic structures Therefore, this thesis aims to fill this gap by focusing on developing a TTS system for the Muong language, a minority language spoken in Vietnam that does not have a writing system (only the Muong Hoa Binh dialect had a writing system in 2016) This research area is novel not only in Vietnam but also worldwide, and the development of a Muong TTS system can contribute to preserving and promoting this endangered language
Context and constraints
This thesis will classify low-resourced languages into two categories: written and unwritten The Muong language will be the object of study in both cases:
Written: The Muong dialect of Hoa Binh will be examined, as it possesses a written form
Unwritten: The Muong dialect of Phu Tho will be investigated, as it lacks a written form
In other regions, the Muong people currently do not use written language They often read directly from Vietnamese text and convert it into Muong speech for broadcasting and
Trang 16communication purposes This research aims to address these challenges and improve the accessibility of TTS technology for both written and unwritten Muong dialects
Moreover, this thesis is conducted within the scope of, and in collaboration with, the project DLCN.20/17: "Research and development automatic translation system from Vietnamese text to Muong speech, apply to unwritten minority languages in Vietnam"
(Nghiên cứu xây dựng hệ dịch tự động văn bản tiếng Việt ra tiếng nói tiếng Mường, hướng đến áp dụng cho các ngôn ngữ dân tộc thiểu số chưa có chữ viết ở ViệtNam) Specific
components of this project include:
Recorded speech from both Muong Hoa Binh and Muong Phu Tho dialects
A machine translation tool that converts Vietnamese text to an intermediate representation of the Muong language
Conversely, the research findings of this thesis have been successfully applied and integrated into the project above, demonstrating the practical value of the work undertaken
in this thesis
Challenges
Challenges Faced by Current Research:
Data Scarcity: The foremost challenge is the paucity of training data TTS models demand substantial text-speech pairs for effective training However, for low-resourced languages, acquiring such data can be exceedingly difficult, if not impossible
Limited Linguistic Knowledge: Inadequate linguistic knowledge hinders TTS system development Understanding language structure, vocabulary, and prosody is crucial, but this knowledge is frequently absent for low-resourced languages
Lack of Linguistic Studies: Linguistic research serves as the backbone for building TTS systems Unfortunately, languages with limited resources often lack comprehensive linguistic studies, making it arduous
to capture essential linguistic characteristics
To address these challenges, this work proposes an adaptive TTS approach that efficiently utilizes limited resources to synthesize high-quality speech for the Muong, a low-resourced language The approach leverages transfer learning techniques from related languages and applies unsupervised learning methods to reduce the need for extensive labelled data In addition, emulating the input of rich-resource TTS is also a good idea with written low-resourced language With an unwritten low-resourced language, an adaptation
is to use text or an intermediate representation of another language to help build better TTS
The proposed approach demonstrates the effectiveness of adaptive TTS in synthesizing low-resourced languages However, further research and investment in linguistic studies for low-resourced languages are necessary to improve the quality of TTS systems With continued efforts, we can develop more robust TTS systems that provide access to speech synthesis for all languages, regardless of their resource availability
Objectives & approachs
This thesis aims to develop a Text-to-Speech (TTS) system for low-resourced languages, focusing on the Muong language, by utilizing adaptation techniques We categorize low-resourced languages into two groups, and for each group, we aim to employ suitable methods
Trang 17 Written low-resourced languages: Using emulating input and an adaptive approach to enhance the available linguistic resources
Unwritten low-resourced languages: Employing intermediate representations or leveraging text from rich-resourced languages to bridge the gap in linguistic resources
In this way, the thesis aims to make TTS technology more accessible to low-resourced languages, thus expanding its applications and fostering communication across diverse linguistic communities By focusing on Muong language as a specific case study, this research not only contributes to the broader field of low-resourced languages but also opens doors for practical applications For instance, it paves the way for the development of applications catering to the Muong community, including Muong radio broadcasts and Muong-speaking newspapers, all generated from Vietnamese text This demonstrates the real-world impact of the research, showcasing its potential to empower minority languages like Muong and preserve their cultural heritage
Contributions
The thesis presents the following key contributions:
First contribution: A method for the synthesizing speech from the written text for a language with limited data, using the Muong language as a specific application case This includes (1) an adaptation technique that utilizes input from a Vietnamese speech synthesis system (without requiring training data) and (2) fine-tuning the Vietnamese speech synthesis model with a small amount of Muong language data
Second contribution: A method for synthesizing speech for an unwritten language using a closely related language with available resources (generating Muong speech from Vietnamese text) This approach treats the Muong language as if it were unwritten The two proposed methods are: (1) employing an intermediate representation and (2) directly converting Vietnamese text into Muong speech
In addition to the two main contributions mentioned above, we also researched the comparison of Vietnamese and Muong languages, drawing several valuable conclusions for phonetic studies and natural language processing We have published various educational materials and tools for processing text and vocabulary in Vietnamese and Muong
Chapter 2, titled "Vietnamese and Muong language": This chapter presents research on the phonology of Vietnamese and Muong languages Computational linguistic resources for Vietnamese speech processing are described in detail as applied in Vietnamese TTS
PART 2: SPEECH SYNTHESIS FOR MUONG AS A WRITTEN LANGUAGE
Chapter 3, titled "Emulating of the Muong TTS based on input
Trang 18systems This approach can be experimentally applied to create TTS systems for other Vietnamese ethnic minority languages quickly
Chapter 4, titled "Cross-lingual transfer learning for Muong speech synthesis,": In this chapter, we use and experiment with approaches for Muong TTS that leverage Vietnamese resources We focus on transfer learning by creating Vietnamese TTS, further training it with different Muong datasets, and evaluating the resulting Muong TTS
PART 3: SPEECH SYNTHESIS FOR MUONG AS AN UNWRITTEN LANGUAGE
Chapter 5, titled "Generate unwritten low-resourced language’s speech directly from rich-resource language’s text," presents our approach for addressing speech synthesis challenges for unwritten low-resourced languages by synthesizing L2 speech directly from L1 text The proposed system is built using end-to-end neural network technology for text-to-speech We use Vietnamese as L1 and Muong as L2 in our experiments
Chapter 6, titled "Speech synthesis for Unwritten low-resourced language using intermediate representation,": This chapter proposes using phoneme representation due to its close relationship with speech within a single language The proposed method is applied to the Vietnamese and Muong language pair Vietnamese text is translated into an intermediate representation of two unwritten dialects of the Muong language: Muong
Bi - Hoa Binh and Muong Tan Son - Phu Tho The evaluation reveals relatively high translation quality for both dialects
In conclusion, speech synthesis for low-resourced languages is a significant research area with the potential to positively impact the lives of speakers of these languages Despite challenges posed by limited data and linguistic knowledge, advancements in speech synthesis technology and innovative approaches enable the developing of high-quality speech synthesis systems for low-resourced languages The work presented in this dissertation contributes to this field by exploring novel methods and techniques for speech synthesis in low-resourced languages
For future work, there is a need to continue developing innovative approaches to speech synthesis for low-resourced languages, particularly in response to the growing demand for accessible technology This can be achieved through ongoing research in transfer learning, unsupervised learning, and data augmentation Additionally, there is a need for further investment in collecting and preserving linguistic data for low-resourced languages and developing phonological studies for these languages With these efforts, we can ensure that speech synthesis technology is accessible to everyone, regardless of their language
Trang 19PART 1 : BACKGROUND AND RELATED WORKS
Trang 20Chapter 1 Overview of speech synthesis and speech synthesis for low-resourced language
This section presents a concise overview of Text-to-Speech (TTS) synthesis and its application to low-resourced languages It highlights the challenges faced in developing TTS systems for languages with limited resources and data Additionally, it introduces various approaches and techniques to address these challenges and improve TTS quality for low-resourced languages
1.1 Overview of speech synthesis
This section offers a brief introduction to the field of speech synthesis It highlights the key concepts and techniques in converting written text into spoken language It also provides
a foundation for understanding the complexities and challenges of developing speech synthesis systems
1.1.1 Overview
Speech synthesis is the artificial generation of human speech using technology A computer system designed for this purpose, known as a speech computer or speech synthesizer, can be realized through software or hardware implementations A text-to-speech (TTS) system specifically converts standard written language text into audible speech, whereas other systems transform symbolic linguistic representations, such as phonetic transcriptions, into speech [1] TTS technology has evolved significantly, incorporating advanced algorithms and machine learning techniques to produce more natural-sounding and intelligible speech output By simulating various aspects of human speech, including pitch, tone, and intonation, TTS systems strive to provide a seamless and user-friendly listening experience
The development of TTS technology has undergone remarkable progress over time:
In the 1950s, pioneers like Homer Dudley with his "VODER" and Franklin S Cooper's "Pattern Playback" initiated the foundation for modern TTS systems
The 1960s brought forth formant-based synthesis, utilizing models of vocal tract resonances to produce speech sounds
The 1970s introduced linear predictive coding (LPC), enhancing speech signal modeling and producing more natural synthesized speech
The 1980s saw the emergence of concatenative synthesis, a method that combined pre-recorded speech segments for the final output
During the 1990s, unit selection synthesis became popular, using extensive databases to select the best-fitting speech units for more natural output
Trang 21 The 2000s experienced the rise of statistical parametric synthesis techniques, such as Hidden Markov Models (HMMs), providing a data-driven and adaptable approach to TTS
The 2010s marked the beginning of deep learning-based TTS with models like Google's WaveNet, revolutionizing speech synthesis by generating raw audio waveforms instead of relying on traditional signal processing
End-to-end neural TTS systems like Tacotron streamlined the TTS process by directly converting text to speech without intermediate stages
Transfer learning and multilingual TTS models have recently enabled the development of high-quality TTS systems for low-resourced languages, expanding the reach of TTS technology
Today, TTS plays a vital role in everyday life, powering virtual assistants, accessibility tools, and various digital content types
Some current applications of text-to-speech (TTS) technology includes:
Assistive technology for the visually impaired: TTS systems help blind and visually impaired individuals by reading text from books, websites, and other sources, converting it into audible speech
Learning tools: TTS systems are used in computer-aided learning programs, aiding language learners and students with reading difficulties
or dyslexia by providing auditory reinforcement
Voice output communication aids: TTS technology assists individuals with severe speech impairments by enabling them to communicate through synthesized speech
Public transportation announcements: TTS provides automated announcements for passengers on buses, trains, and other public transportation systems
E-books and audiobooks: TTS systems can read electronic books and generate audiobooks, making content accessible to a broader audience
Entertainment: TTS technology is utilized in video games, animations, and other forms of multimedia entertainment to create realistic and engaging voiceovers
Email and messaging: TTS systems can read emails, text messages, and other written content aloud, helping users stay connected and informed
Call center automation: TTS is employed in automated phone systems, allowing users to interact with voice-activated menus and complete transactions through spoken commands
Virtual assistants: TTS is a crucial component of popular voice-activated virtual assistants like Apple's Siri, Google Assistant, and Amazon's Alexa, enabling them to provide spoken responses to user queries
Trang 22 Voice search applications: By integrating TTS with speech recognition, users can use speech as a natural input method for searching and retrieving information through voice search apps
In conclusion, TTS technology has come a long way since its inception, with continuous advancements in algorithms, machine learning, and deep learning techniques As a result, TTS systems now provide more natural-sounding and intelligible speech, enhancing the user experience across various applications such as assistive technology, learning tools, entertainment, virtual assistants, and voice search The ongoing development and integration
of TTS into our daily lives will continue to shape the future of human-computer interaction and digital accessibility
1.1.2 TTS architecture
The architecture of a TTS system is generally composed of several components, as depicted in Figure 1.1 The Text Processing component is responsible for preparing the input text for speech synthesis The G2P Conversion component converts the written words into
their corresponding phonetic representations The Prosody Modeling component adds appropriate intonation, duration, and other prosodic features to the phonetic sequence Lastly, the Speech Synthesis component generates the speech waveform based on the parameters derived from the fully tagged phonetic sequence [2]
Text processing is crucial for identifying and interpreting all textual or linguistic information that falls outside the realms of phonetics and prosody Its primary function is to transform non-orthographic elements into words that can be spoken aloud Through text normalization, symbols, numbers, dates, abbreviations, and other non-orthographic text elements are converted into a standard orthographic transcription, facilitating subsequent phonetic conversion Additionally, analyzing whitespace, punctuation, and other delimiters
is vital for determining document structure and providing context for all subsequent steps Certain text structure elements may also directly impact prosody Advanced syntactic and semantic analysis can be achieved through effective text-processing techniques [2, p 682] The phonetic analysis aims to transform orthographic symbols of words into phonetic representations, complete with any diacritic information or lexical tones present in tonal languages Although future TTS systems might rely on word-sounding units and possess increased storage capacity, homograph disambiguation and grapheme-to-phoneme (G2P) conversion for new words remain essential for accurate pronunciation of every word G2P
Figure 1.1 Basic system architecture of a TTS system [22]
Trang 23conversion is relatively straightforward in languages with a clear relationship between written and spoken forms A small set of rules can effectively describe this direct correlation, which is characteristic of phonetic languages such as Spanish and Finnish Conversely, English is not a phonetic language due to its diverse origins, resulting in less predictable letter-to-sound relationships In these cases, employing general letter-to-sound rules and dictionary lookups can facilitate the conversion of letters to sounds, enabling the correct pronunciation of any word [2, p 683]
In TTS systems, prosodic analysis involves examining prosodic features within the text input, such as stress, duration, pitch, and intensity This information is then utilized to generate more natural and expressive speech Prosodic analysis helps determine the appropriate stress, intonation, and rhythm for the synthesized speech, resulting in a more human-like output Predicting prosodic features can be achieved through rule-based or machine-learning methods, including acoustic modeling and statistical parametric speech synthesis By adjusting the synthesized speech, TTS systems can convey various emotions
or speaking styles, enhancing their versatility and effectiveness across diverse applications Speech synthesis employs anticipated information from the fully tagged phonetic sequence to generate the corresponding speech waveform Broadly, two traditional speech synthesis techniques are concatenative and source/filter synthesizers Concatenative synthesizers assemble pre-recorded human speech components to produce the desired utterance In contrast, source/filter synthesizers create synthetic voices using a source/filter model based on the parametric description of speech The first method necessitates assistance in generating high-quality speech using the input text's parametric representation and speech parameters Meanwhile, the second approach requires a combination of algorithms and signal processing adjustments to ensure smooth and continuous speech, particularly at junctures
Several improvements have been proposed for high-quality text-to-speech (TTS) systems, drawing from the two fundamental speech synthesis techniques Among the most prominent state-of-the-art methods are statistical parametric speech synthesis and unit selection techniques, which have been the subject of extensive debate among researchers in the field
Figure 1.2 Neural TTS architecture [3]
With the advancement of deep learning, neural network-based TTS (neural TTS) systems have been proposed, utilizing (deep) neural networks as the core model for speech synthesis
A neural TTS system comprises three fundamental components: a text analysis module, an acoustic model, and a vocoder As illustrated in Figure 1.2 the text analysis module transforms a text sequence into linguistic features The acoustic model then generates acoustic features from these linguistic features, and finally, the vocoders synthesize the waveform from the acoustic features
1.1.3 Evolution of TTS methods over time
The evolution of TTS methods has progressed significantly over time, with advancements in technology and research contributing to more natural and intelligible
Trang 24speech synthesis Early TTS systems relied on rule-based methods and simple concatenation techniques, which have since evolved into sophisticated machine learning approaches, including neural network-based TTS systems These modern systems offer improved speech quality, prosody, and adaptability, resulting in more versatile applications across various industries
1.1.3.1 TTS using unit-selection method
The unit-selection approach allows for the creation of new genuinely sounding utterances
by picking relevant sub-word units from a natural speech database [4], based on how well a chosen unit matches a specification/a target unit (and how well two chosen units join together) During synthesis, an algorithm chooses one unit from the available options to discover the best overall sequence of units that meets the specification [1] The specification and the units are described by a feature set that includes linguistic and speech elements The feature set is used to do a Viterbi-style search to determine the sequence of units with the lowest total cost
Although they are theoretically quite similar, the review of Zen [4] suggests that there are two fundamental methods in unit-selection synthesis: (i) the selection model [5], shown in Figure 1.3a; (ii) the clustering approach [6], shown in Figure 1.3b, which effectively enables the target cost to be pre-calculated The second method asks questions about features available at the time of synthesis and groups units of the same type into a decision tree
In the selection model for TTS synthesis, speech units are chosen based on a cost function calculated in real time during the synthesis process This cost function considers the acoustic and linguistic similarity between the target text and available speech units in the database, selecting the unit with the lowest cost for synthesis Conversely, the clustering approach pre-calculates the cost for each speech unit, grouping similar units into a decision tree This tree allows for rapid speech unit selection during synthesis based on available features, reducing the real-time computation and resulting in faster, more efficient TTS synthesis Both methods have their advantages and disadvantages, with the selection model offering greater flexibility for adapting to different languages and voices and the clustering approach
Figure 1.3 General and clustering-based unit-selection scheme: Solid lines represent
target costs and dashed lines represent concatenation costs [13]
Trang 25providing enhanced speed and efficiency The choice between these methods depends on the specific needs of the TTS system being developed
1.1.3.2 Statistical parameter speech synthesis
In a typical statistical parametric speech synthesis system, a set of generative models is used to model the parametric speech representations extracted from a speech database, including spectral and excitation parameters (also known as vocoder parameters are used as inputs of the vocoder) The model parameters are frequently estimated using the Maximum Likelihood (ML) criterion Then, to maximize their output probabilities, speech parameters are constructed for a specific word sequence to be synthesized from the estimated models Finally, a speech waveform is built from the parametric representations of speech [4] Any generative model can be employed; however, HMMs are mainly well-known In HMM-based speech synthesis (HTS) [7], context-dependent HMMs statistically model and produce the speech parameters of a speech unit, such as the spectrum and excitation parameters (for example, fundamental frequency - F0) A typical HMM-based speech
synthesis system's core architecture, as shown in Figure 1.4 [8], consists of two main processes: training and synthesis
The Expectation Maximization (EM) algorithm is used to do the ML estimation (MLE) during training, and it is similar to speech recognition The primary distinction is that excitation and spectrum parameters are taken from a database of natural speech that a collection of multi-stream context-dependent HMMs has modeled Excitation parameters include log F0 and its dynamic properties
Another distinction is adding prosodic and linguistic circumstances to phonetic settings (called contextual features) The state-duration distribution for each HMM is also used to describe the temporal structure of speech The Gamma distribution and the Gaussian
Figure 1.4 Core architecture of HMM-based speech synthesis system [25]
Trang 26distribution are options for state-duration distributions In order to estimate them, the forward-backward method used statistical data that was gathered during the previous
iteration
An inverse speech recognition procedure is carried out throughout the synthesis process The utterance HMM is built by concatenating the context-dependent HMMs by the label sequence after a given word sequence is transformed into a context-dependent label sequence Second, the speech parameter generation algorithm creates spectral and excitation parameter sequences from the utterance HMM The obtained spectral and excitation parameters are then used to create a speech waveform using a speech synthesis filter and a vocoder with a source-excitation/filter model [4]
Figure 1.5 General HMM-based synthesis scheme [13, p 5]
Trang 27Figure 1.5 illustrates the general scheme of based synthesis [4, p 5] An based TTS system defines a feature system and trains a different model for each feature combination Because each has its context dependency, the spectrum, excitation, and duration are all modeled simultaneously in an integrated framework of HMMs Due to the combinatorial explosion of contextual information, their parameter distributions are grouped independently and contextually using phonetic decision trees The models corresponding to the entire context label sequence, which was predicted from the text, are concatenated to create the speech parameters The duration model is used to select a state sequence prior to producing parameters This establishes the number of frames that will be produced from each model state Actual speech, where the fluctuations in speech characteristics are smoother, would not fit this well
HMM-1.1.3.3 Speech synthesis using deep neural networks
A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers [9], [10] DNNs can model complex non-linear relationships DNN architectures generate compositional models where the object is expressed as a layered composition of primitives [11] The extra layers enable the composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing external network [9]
Deep architectures contain numerous variations on a few fundamental ideas Every architecture has achieved success in particular fields Unless they have been tested on the same data sets, comparing the performance of different architectures is seldom viable DNNs are typically feedforward networks in which information moves straight from the input layer
to the output layer
Figure 1.6 illustrates a speech synthesis framework based on a DNN A given text to be synthesized is first converted to a sequence of input features {xnt}, where xnt denotes the n-
Figure 1.6 A speech synthesis framework based on a DNN [29]
Trang 28th input feature at frame t The input features include binary answers to questions about linguistic contexts and numeric values [12]
1.1.3.4 Neural speech synthesis
As deep learning evolves, neural network-based TTS (neural TTS for short) is proposed, which uses (deep) neural networks as the speech synthesis model's backbone, as shown in Figure 1.2 SPSS has incorporated early neural models to replace HMM for audio modeling Later, WaveNet [13] is proposed to produce waveform directly from language information, making it the first contemporary neural TTS model Other models, such as DeepVoice 2 [14], adhere to the three components of statistical parametric synthesis but enhance them with neural network-based models Moreover, several end-to-end models (e.g., Tacotron 2 [15, p 2], Deep Voice 3 [16], and FastSpeech 2 [17]) are proposed to simplify text analysis modules and directly accept character/phoneme sequences as input, as well as to simplify acoustic characteristics with Mel-spectrograms Later, end-to-end TTS systems such as
ClariNet [18], FastSpeech 2 [17], and EATS [19] are created to generate waveform directly from the text The advantages of neural network-based speech synthesis over prior TTS systems based on concatenative synthesis and statistical parametric synthesis include great voice quality in terms of both intelligibility and naturalness and reduced need for human preprocessing and feature development
The End-to-End [15] method proposed by Google in 2017 is based on the Seq2Seq model widely used in machine translation Seq2Seq includes two components: An encoder and a Decoder Both components are neural networks The encoder converts input data (input sequence) into a language representation At the same time, the decoder is responsible for generating output sound from language characteristics created in the Encoder section This
is the best speech synthesis method today, typically the Tacotron system, which produces voices closest to natural human voices
The End-to-End method has the advantage of having less module processing, so the discrepancy between the predicted results and the input is small, resulting in the voice quality being closest to natural However, the downside of this method is that the amount of data needed to train the model is enormous, along with the training time that takes tens of hours
or even weeks and requires tremendous computer performance Therefore, the cost of building these systems is enormous [19]
Figure 1.7 Encoder and Decoder diagram in Seq2Seq model
Trang 29Fully end-to-end TTS models can generate speech waveform straight from a sequence of characters or phonemes, which has the following benefits: It can also cut training, development, and deployment costs [3]
However, training TTS models end-to-end is challenging, primarily due to the differences
in modalities between text and speech waveforms and the significant length disparity between character/phoneme sequences and waveform sequences For a 5-second speech with approximately 20 syllables, the length of the phoneme sequence is around 100, while the length of the waveform sequence is 80,000 (assuming a 16kHz sample rate) Memory constraints make it difficult to include the waveform points of an entire utterance during model training Additionally, capturing context representations is problematic when using only a short audio clip for end-to-end training
In end-to-end models [3], only text normalization and grapheme-to-phoneme conversion are preserved to transform characters into phonemes, or the entire text analysis module can
be omitted by directly taking characters as input Acoustic features are simplified, where complex characteristics such as MGC, BAP, and F0 employed in SPSS are consolidated into Mel-spectrograms Additionally, two or three modules can be replaced with a single end-to-end model For instance, acoustic models and vocoders can be substituted by a unified vocoder model, like WaveNet
Numerous advanced vocoders have emerged in neural TTS systems to enhance speech synthesis quality One prominent example is WaveNet [13], created by Google's DeepMind WaveNet is a sophisticated generative model that employs convolutional neural networks (CNNs) to generate raw audio waveforms directly By modeling temporal dependencies in audio data using a large receptive field, it attains high-quality and natural-sounding speech The success of WaveNet has generated significant interest in the field of speech synthesis, laying the groundwork for future advancements
Another prevalent approach to vocoding in neural TTS involves the use of generative adversarial networks (GANs), which has given rise to the development of the GAN-based vocoder family HiFi-GAN [20], a distinguished example within this group, produces high-fidelity speech from input acoustic features By employing a multi-scale generative network and a multi-resolution discriminator, it captures both local and global structures in the generated audio, yielding high-quality and natural-sounding speech The adversarial training process in GAN-based vocoders contributes to refining the synthesized speech, making it more authentic and expressive
Both WaveNet and GAN-based vocoders, such as HiFi-GAN, have significantly contributed to the advancements in neural TTS They offer more natural and high-quality synthesized speech, enabling TTS systems to be more versatile and effective across various applications, including virtual assistants, audiobook narration, and accessibility services for visually impaired users
Fully end-to-end TTS models, which generate speech waveforms directly from character
or phoneme sequences, offer several advantages over traditional cascaded approaches:
They require less human annotation and feature development, such as alignment information between text and speech, reducing the need for labor-intensive manual work
Joint and end-to-end optimization can prevent error propagation common
in cascaded models, such as those involving text analysis, acoustic
Trang 30models, and vocoders End-to-end models can achieve more accurate and efficient speech synthesis by streamlining the process
These models can reduce training, development, and deployment costs, making them a more attractive option for various applications
Overall, fully end-to-end TTS models present a promising direction for the future of speech synthesis technology
Some notable examples of fully end-to-end TTS models include:
WaveNet [13]: A Generative Model for Raw Audio - Developed by DeepMind, WaveNet introduced a novel deep learning architecture for speech synthesis, using a fully end-to-end approach to generate speech waveforms directly from text input Based on PixelCNN, it employs a dilated causal convolutional network to model the temporal dependencies
in the audio signal, generating high-quality, natural-sounding speech widely adopted in applications like Google Assistant
Conditional Variational Autoencoder with Adversarial Learning for to-End Text-to-Speech (VITS) – VITS [21] is a state-of-the-art end-to-end TTS model combining the strengths of conditional variational autoencoders (CVAEs) and adversarial learning It leverages variational inference techniques and generative adversarial networks (GANs) to produce high-quality, natural-sounding speech directly from text input Jointly optimizing both components, VITS prevents error propagation common in cascaded models and balances speech quality and training efficiency
End- NaturalSpeech [22]: End-to-End Text-to-Speech Synthesis with Level Quality - Developed by Microsoft, NaturalSpeech aims to generate human-level quality speech using a fully end-to-end approach, synthesizing speech waveforms directly from character or phoneme sequences Eliminating the need for intermediate steps such as text analysis, acoustic models, and vocoders, NaturalSpeech streamlines the synthesis process, leading to more accurate and efficient results This advanced TTS model has the potential to revolutionize the way we generate human-like speech from text, opening up new possibilities for various applications
Human-These fully end-to-end TTS models exemplify the ongoing advancements in speech synthesis technology, showcasing the potential for a more natural and efficient generation
of human-like speech in the future
Identifying suitable features for synthesis can be accomplished using deep learning techniques with neural networks These neural networks can determine appropriate features originating from the character level in the input text The concept of end-to-end speech synthesis involves the network accepting a string of characters as input, processing it through hidden layers in the network, and generating an output audio signal At this point, several proposals have been made to construct such end-to-end aggregation networks The advantages of these synthesis methods include a lightweight system in the typical processing stage, ease of adaptation to new data, increased robustness compared to systems with
Trang 31multiple interconnected modules as errors do not accumulate, and more However, the challenge of direct TTS conversion arises because the same text can correspond to multiple pronunciations or speech patterns Training should encompass as many signal-level variations for a given input text as possible
a) Char2Wav speech synthesis system
The Char2Wav speech synthesis system, developed by Sotelo [23], aims to construct an end-to-end system trained with a string as input Char2Wav consists of two components: an encoder-decoder model with an attention mechanism serving as the reader, and a neural vocoder (Figure 1.8) The built-in encoder is a bidirectional recurrent neural network with text or phonemic strings as input The decoder is an attention-based neural network The decoder's outputs are intermediate feature representations for the input of a SampleRNN neural vocoder This intermediate representation comprises the vocoder features suggested
in the World Vocoder set The SampleRNN network is enhanced to accept previously generated tonal samples and feature frames from the decoder as input The system's final output consists of raw acoustic wave samples Consequently, Char2Wav also generates audio directly from text However, Char2Wav still relies on the features of the World vocoder, and the sequence-to-sequence and SampleRNN models require initial training
Figure 1.8 Char2Wav model [23]
b) Tacotron synthesis system
Tacotron is an end-to-end speech synthesis system that takes text input directly [24] The input consists of a character string, while the output is the spectrogram of the signal The system model undergoes training from scratch using <text, speech> pairs Moreover,
Trang 32Tacotron generates frame-level speech, which is faster than the sample-level speech generation methods mentioned previously
Figure 1.9 Model of the Tacotron synthesis system [24]
Figure 1.9 depicts the system's architecture, including the encoding part based on CBHG (Convolutional 1-D filters Bank + Highway networks + Gated recurrent unit bidirectional) and the decoding part based on the attention mechanism Each part includes the connection
of many different architectural components The output of the encoder phase is transformed through the CBHG module (behind the processing network) to predict the sampled spectral amplitude on a linear frequency scale Then the Griffin-Lim algorithm was used to reconstruct the output audio from this spectrogram This synthesis will not need features from previous TTS systems as in WaveNet Technical details are presented by [24] However, spectrograms can represent speech but do not carry phase information Therefore, the Griffin-Lim algorithm is used to estimate the phase information; then, the inverse short-time Fourier transform is used However, the sound output quality in this way could be higher The Tacotron development team also suggested that the system must still be developed, perfecting the spectrogram to acoustic waveform converter
By the end of 2017, Tacotron had been developed to version 2 [25], overcoming the disadvantage of converting the spectrogram to sound waveform The system consists of (1)
a sequence-to-string-recursive feature prediction network, which allows the mapping of character packets to spectrogram representations, and (2) a transform Wavenet model that performs sound wave generation from these spectrogram representations Using a spectrogram instead of traditional input features of the WaveNet network, such as language information, time, F0, etc., reduces the size of the WaveNet network significantly The architectural model of the system is depicted in Figure 1.10
Trang 33Figure 1.10 Block diagram of the Tacotron 2 system architecture [25]
Tacotron2 is trained directly on the normalized sequence of characters and the corresponding acoustic waveform Tacotron2 is considered to produce synthetic voices of natural human-like quality In addition, Tacotron2 is considered able to handle cases of out-of-domain and complex words, learn pronunciation based on the semantics of sentences, pronounce well when the input word is spelled incorrectly, and learn sentence tone
1.2 Speech synthesis for low-resourced languages
The development of interactive systems for under-resourced languages [26] faces challenges due to the need for more data and minimal research in this area The SLTU-CCURL2 workshops and SIGUL3 meetings aim to gather researchers working on speech and NLP for these languages to exchange ideas and experiences These events foster innovation and encourage cross-disciplinary collaboration between fields like computer science, linguistics, and anthropology The focus is on promoting the development of spoken language technologies for low-resourced languages, covering topics like speech recognition, text-to-speech synthesis, and dialogue systems By bringing together academic and industry researchers, these meetings help address the challenges faced in under-resourced language processing
Many investigations for low-resourced languages have been conducted recently using a variety of methods, including applying speaker characteristics [27], modifying phonemic features [28], [29], and cross-lingual text-to-speech [30], [31] Yuan-Jui Chen et al introduced end-to-end TTS with cross-lingual transfer learning [32] The authors proposed
a method to learn a mapping between source and target linguistic symbols because the model trained on the source language cannot be directly applied to the target language due to input space mismatches By using this memorization mapping, pronunciation information can be
2 http://sltu-ccurl-2020.ilc.cnr.it/
3 https://sigul-2022.ilc.cnr.it/
Trang 34kept throughout the transfer proess Sahar Jamal et al [33] used transfer learning for the experiments to take advantage of the low-resourced scenario The information obtained then trains the model with a significantly smaller collection of Urdu training data The authors created standalone Urdu and learning systems by using pre-trained Tacotron models of English and Arabic as parent models Marlene Staib et al [34] improved or matched the performance of many baselines, including a resource-intensive expert mapping technique,
by swapping out Tacotron 2’s character input for a manageably small set of IPA-inspired features This model architecture also enables the automated approximation of sounds that have not been seen in training They demonstrated that a model trained on one language could produce intelligible speech in a target language even in the lack of acoustic training data A similar approach [35] is used in transfer learning, where a high-resource English source model is fine-tuned with either 15 minutes or 4 hours of transcribed German data Data augmentation is a different approach that researchers apply to solve the low-resourced language challenge [36]–[38] An innovative three-step methodology has been developed for constructing expressive style voices using as little as 15 minutes of recorded target data, circumventing the costly operation of capturing large amounts of target data Firstly, Goeric Huybrechts et al [36] augment data by using recordings of other speakers whose speaking styles match the desired one In the next step, they use synthetic data to train a TTS model based on the available recordings Finally, the model is fine-tuned to improve quality Muthukumar and his colleagues have developed a technique for automatically constructing phonetics for unwritten languages [39] Synthesis may be improved by switching to a representation closer to spoken language than written language
The main challenges to address when developing TTS for under-resourced languages are
1 synthesizing speech for languages with a writing system but limited data; 2 synthesizing speech for languages without a writing system, using input text or speech from another language Key research directions, such as adaptation and polyglot approaches, will be discussed in detail in the following sections to tackle these challenges
1.2.1 TTS using emulating input approach
The rationale behind this approach is to leverage an existing TTS system for a base language (Base Language - BL) to simulate TTS for an unsupported language (target language - TL) This strategy aims to assist individuals who speak unsupported languages when communicating in another language is inconvenient, such as when new immigrants visit a doctor While TTS plays a role in translating doctor-patient conversations, text-based communication is also essential in healthcare Consequently, TTS becomes necessary for users with limited English proficiency or literacy skills in their native language, enabling them to access and understand vital information [40]
The first emulating idea given by Evans et al [41], the team developed the simulator to fit a screen reader They describe a method that enables the production of text-to-speech synthesizers for new languages with assistive apps The method employs a straightforward rule-based text-to-phoneme step The phonemes are transmitted to a phoneme-to-speech system for another language They demonstrate that the correspondence between the language to be synthesized and the language on which the phoneme-to-speech system is based is crucial for the perceived quality of speech but not necessarily for speech comprehension They report the exam in Greek but can apply the same method with equal success for Albanian, Czech, Welsh, and additional languages
Trang 35Three primary challenges exist in simulating a target language (TL) using a base language (BL) First, it is essential to choose BL phonemes that closely resemble those of the TL's phonemes Second, the goal is to minimize discrepancies in text-to-phoneme mapping Lastly, we must select a BL with linguistic features that closely align with those of the TL's linguistic features These three challenges can lead to different approaches, and ultimately, the balance achieved will be significantly influenced by the decisions made regarding the
BL [40]
In the study by Evans et al [41], the evaluation process has a unique aspect compared to conventional TTS assessment This distinction is essential to understand as it highlights the tailored approach needed for evaluating TTS systems in under-resourced languages The MRT (Mean Opinion Score - Revised) is a variation of the traditional MOS (Mean Opinion Score) assessment The conventional MOS is a subjective evaluation method used to gauge the overall quality of speech synthesis systems In contrast, MRT focuses on the clarity and usability of the synthesized speech in low-resource settings This shift in focus makes MRT
a more suitable evaluation method for under-resourced languages The study used nonsensical words and simple sentence structures as test cases to evaluate the Greek TTS system This approach was chosen because, in under-resourced languages, ensuring that the TTS system can generate clear and understandable speech even when faced with unusual or uncommon linguistic structures is crucial By using these "fake" cases, the evaluation can better assess the system's performance and robustness in challenging situations
Harold Somers and his colleagues proposed a "emulating" approach for developing TTS systems in under-resourced languages, as explored in their publications [40] and [42] They aimed to create a TTS system for Somali, an under-resourced language, by leveraging an existing TTS system for a well-resourced language The researchers also discussed various experimental designs to assess TTS systems developed using this approach, emphasizing the importance of evaluating speech quality, intelligibility, and usefulness This method utilizes existing resources from well-resourced languages, showing potential for developing TTS systems for under-resourced languages By investigating different experimental designs and evaluation methods, researchers can better comprehend the challenges, opportunities, and limitations of this approach
The advantages and disadvantages of the "emulating" approach for low-resourced languages, as well as its applicability, can be summarized as follows:
Advantages:
Resource efficiency: By leveraging existing TTS systems for resourced languages, the need for extensive data collection and development efforts can be reduced
rich- Faster development: Utilizing existing resources accelerates the development process for TTS systems in low-resourced languages
Cross-disciplinary collaboration: The "emulating" approach fosters collaboration among researchers in various fields, such as computer science, linguistics, and anthropology
Disadvantages:
Speech quality: Synthesized speech quality may be compromised due to the mismatch between the base and target languages
Trang 36 Intelligibility: Depending on the similarity between the base and target languages, the intelligibility of the generated speech might be limited
Customizability: The "emulating" approach might not be suitable for every low-resourced language, especially if there is no closely-related rich-resourced language to use as a base
Applicability:
Languages with similar phonetic or linguistic characteristics: The
"emulating" approach is most applicable when the target low-resourced language shares phonetic or linguistic features with a well-resourced language
Situations requiring rapid TTS system development: In cases where a TTS system is urgently needed for an low-resourced language, the
"emulating" approach can provide a quicker solution than traditional methods
Initial system development: The "emulating" approach can serve as a starting point for developing a more refined TTS system for low-resourced languages, allowing researchers to identify specific challenges and opportunities for improvement
In summary, the "emulating" approach presents a promising direction for developing TTS systems for low-resourced languages However, its success depends on selecting a suitable base language and overcoming the limitations inherent in this method
1.2.2 TTS using the polyglot approach
Polyglot TTS and multilingual TTS are often used interchangeably, but they can have slightly different meanings depending on the context:
Polyglot TTS: A single TTS model is trained to handle multiple languages simultaneously in the polyglot approach The model can synthesize speech in various languages using the same architecture and shared parameters The polyglot approach aims to leverage commonalities among languages and transfer knowledge from rich-resourced languages
to low-resourced languages This approach can be more resource-efficient and scalable compared to building separate TTS models for each language
Multilingual TTS: Multilingual TTS is a broader term that refers to any TTS system capable of handling multiple languages, regardless of the specific architecture or method used A multilingual TTS system can include separate TTS models for each language or use a shared model like
in the polyglot approach The main goal of multilingual TTS systems is
to support speech synthesis in various languages
In summary, polyglot TTS is a specific approach to building multilingual TTS systems where a single model is used for multiple languages On the other hand, multilingual TTS is
a more general term that encompasses any TTS system capable of handling multiple
Trang 37languages, whether it uses separate models for each language or a shared model like in the polyglot approach
Below are a few notable examples of this approach First, consider one language as the primary language for building cross-linguistic polyglot TTS, as researched by Samsudin [43], [44] Any system utilizing this framework can synthesize different languages using the same collection of recorded or trained voices Next is the synthesizing speech from mixed-language text, as described by H Romsdorfer et al [45] This technique is advantageous when multiple languages appear within a single text, such as in cases of xenomorph occurrences In these situations, swapping corpora (datasets used to train TTS systems) for each language in the text would be impractical Polyglot speech synthesis resolves this issue
by enabling TTS systems to synthesize speech from text containing multiple languages seamlessly and coherently Polyglot speech synthesis relies on text analysis and language identification to discern the different languages present in the text and select the appropriate TTS system for each language This allows the TTS system to produce coherent and natural-sounding speech, even when the text contains multiple languages Overall, polyglot speech synthesis is a promising approach for addressing the challenges of synthesizing speech from mixed-language text and can potentially enhance the quality and effectiveness of TTS systems for low-resourced languages
For a more detailed description of the TTS polyglot, we give the detailed architecture in Figure 1.11 In that picture, a speaker-adaptable polyglot voice synthesizer has two phases: training and synthesis Section (1) shows the cross-language case, when the SI model is
Figure 1.11 Scheme of a HMM-based polyglot synthesizer [48]
Trang 38adapted to a speaker of one of the languages included in the training data and text in any of these languages is synthesized Sections (2) and (3) show the adaptation and synthesis of extrinsic languages using phone mapping (1) Basic scheme of an HMM-based polyglot synthesizer, (2) adaptation to speakers of extrinsic languages, (3) synthesis of extrinsic languages During training, speech collections in target languages are analyzed, and their spectral features are stored using Hidden Markov Models (HMM) The system creates speaker-independent HMMs and then uses speaker adaptation to improve the consistency
of the synthesized speech quality The second section of the diagram shows how the system adapts the phoneme mapping when the target language is not present in the training data This architecture requires voice recordings and the participation of native speakers in creating language materials
Recently, with the explosion of neural technology, the approach to creating polyglot TTS systems has undergone significant changes, yielding better results This is evident in the development of voice cloning-based Polyglot NTTS systems A common technique involves training multilingual Neural Text-To-Speech (NTTS) models using only monolingual datasets In training these models, it is crucial to understand how the composition of the training corpus impacts the quality of multilingual speech synthesis For example, given the close relationship between Spanish and Italian, a typical question is,
"Would adding more Spanish data improve my Italian synthesis?" Ziyao Zhang [46] carried out an extensive ablation study to determine how various training corpus characteristics, such as language family affiliation, gender composition, and the number of speakers, affect the quality of polyglot synthesis Their findings include the observation that most cases favour female speaker data, and having more speakers from the target language variety in the training corpus is only sometimes advantageous These insights are informative for data acquisition and corpus development processes
In summary, polyglot TTS systems have shown great potential in addressing the challenges of synthesizing speech for multilingual and mixed-language texts These systems utilize a single model or shared architecture and parameters across multiple languages to use commonalities and facilitate knowledge transfer from rich-resourced to low-resourced languages This approach proves to be more resource-efficient and scalable than creating separate TTS models for each language
Research by Samsudin [43], [44], H Romsdorfer et al [45], and Ziyao Zhang [46] demonstrates that polyglot TTS systems can generate coherent and natural-sounding speech, even when dealing with mixed-language texts Furthermore, the development of voice cloning-based Polyglot NTTS systems, and their use of monolingual datasets showcase the potential of neural technology to enhance the quality and effectiveness of TTS systems for low-resourced languages
Polyglot TTS systems offer several advantages:
Resource efficiency and scalability, resulting from a shared architecture and parameters
The ability to exploit similarities between languages and transfer knowledge from rich-resourced to low-resourced languages
Seamless and coherent handling of mixed-language texts
However, polyglot TTS systems also have some drawbacks:
Trang 39 Achieving natural-sounding speech for specific languages can be challenging, especially when the target language is underrepresented in the training corpus
A limited understanding of optimal dataset composition for certain language pairs or families may lead to subpar synthesis quality
Despite these challenges, polyglot TTS systems are highly applicable for low-resourced languages The capability to synthesize speech in various languages using a shared architecture and parameters makes this approach particularly appealing for developing TTS systems for languages with limited resources Insights gained from studies on training corpus characteristics and the potential of neural technology in polyglot TTS further emphasize the importance of this approach for low-resourced languages Ultimately, polyglot TTS systems hold considerable promise for addressing the challenges of synthesizing speech for multilingual and mixed-language texts and have the potential to improve the quality and effectiveness of TTS systems for low-resourced languages
1.2.3 Speech synthesis for low-resourced language using the adaptation approach
The adaptation approach for TTS systems, incorporating cross-lingual transfer, aims to improve speech synthesis for low-resourced languages by leveraging resources and knowledge from rich-resourced languages Cross-lingual transfer can provide more natural-sounding speech with relatively limited data by adapting existing TTS models and parameters to accommodate the target low-resourced language This method is particularly beneficial for languages that lack extensive training data, as it allows the development of TTS systems without requiring large, language-specific datasets Cross-lingual transfer in the adaptation technique enhances the scalability and efficiency of creating TTS systems for low-resourced languages and promotes greater inclusivity in the realm of speech synthesis technology
According to a survey by Xu Tan [47], although paired text and speech data is limited in low-resource languages, it is abundant in rich-resource languages Since human languages share similar vocal organs, pronunciations [48], and semantic structures [49], pre-training TTS models on rich-resource languages can assist in mapping text to speech in low-resourced languages [32], [50]–[53] Typically, there are different phoneme sets between rich-resource and low-resourced languages As a result, Chen et al [32] propose a method
to map the embeddings between phoneme sets from various languages, helping to bridge the gap between them
Trang 40Figure 1.12 Approaches to transfer TTS model from source language to target language [32]
Figure 1.12 illustrates the approaches to transfer TTS models from source languages to target languages: (a) separate symbol space, (b) unified symbol space, and (c) learned symbol space (d) depicts the Phonetic Transformation Network (PTN) training scheme for obtaining the learned symbol space The TTS model is first pre-trained using data from a rich-resource (source) language and is then adapted to low-resourced (target) languages To address the input space mismatch across languages, a Phonetic Transformation Network (PTN) model is employed to discover a mapping between source and target linguistic symbols based on their pronunciation This learned mapping allows pronunciation information to be preserved throughout the transfer process Objective and subjective tests indicate that only a small amount of paired data in the target language is needed for our transfer learning approach to generate intelligible speech In cases where the input linguistic symbols of both source and target languages are phonemes, our approach is competitive with the transfer learning method that uses handcrafted mapping based on the International Phonetic Alphabet (IPA) Moreover, our symbol mapping remains applicable even when lexicons of target languages are unavailable, allowing TTS to transfer from source languages with phonemes as input to target languages with characters as input Analytical studies demonstrate that the automatically discovered mapping correlates well with phonetic expertise Experiment results reveal that our method enables the model to produce significantly more natural-sounding speech than the model trained solely on target data and achieves promising results compared to the method utilizing strong linguistic background expertise
In summary, the adaptation approach for TTS in low-resourced languages offers several advantages and disadvantages, as well as various application scenarios:
Advantages:
Reduced data requirements: The adaptation approach requires only a small amount of paired data from the target language, making it suitable for languages with limited resources