1. Trang chủ
  2. » Luận Văn - Báo Cáo

Expressive speech synthesis

68 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Expressive Speech Synthesis
Tác giả Nguyen Thi Ngoc Anh
Người hướng dẫn Dr. Nguyen Thanh Hung
Trường học Hanoi University of Science and Technology
Chuyên ngành Information and Communication Technology
Thể loại Graduation Thesis
Năm xuất bản 2023
Thành phố Hanoi
Định dạng
Số trang 68
Dung lượng 2,23 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Expressive speech synthesis is the ability to convey emotions and attitudes through synthesized speech.. 1.1.2 Emotional FeaturesValence and arousal are two key features used to describe

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER THESIS Expressive Speech Synthesis

NGUYEN THI NGOC ANH

Anh.NTN211269M@sis.hust.edu.vn

School of Information and Communication Technology

Supervisor: Dr Nguyen Thanh Hung

Supervisor’s signature

School: Information and Communication Technology

18th May 2023

Trang 2

Graduation Thesis Assignment

Name: Nguyen Thi Ngoc Anh

Phone: +84342612379

Email: Anh.NTN211269M@sis.hust.edu.vn; ngocanh2162@gmail.com

Class: CH2021A

Affiliation: Hanoi University of Science and Technology

thesis were performed by myself under the supervision of Dr Nguyen ThanhHung All the results presented in this thesis are truthful and are not copied fromany other works All references in this thesis including images, tables, figures,and quotes are clearly and fully documented in the bibliography I will take fullresponsibility for even one copy that violates school regulations

Student

Signature and Name

Trang 3

I’d like to take this opportunity to thank everyone who has been so supportive

of me throughout my academic career To begin, I’d like to thank Dr NguyenThanh Hung for his unwavering support and encouragement throughout my mas-ter’s studies His support and guidance have been instrumental in helping meachieve my academic goals

In addition, I’d like to thank Dr Nguyen Thi Thu Trang and her colleagues in Lab

914 for their assistance in completing the experiments Their willingness to sharetheir knowledge and skills has been very helpful to me, and I’ve learned a lot fromthem Her knowledge, advice, and support were very important to my academiccareer, and I will always be grateful to her

I also want to thank Dr Do Van Hai and my other coworkers at Viettel CyberspaceCenter for their constant help and support during my master’s studies Their will-ingness to lend a hand and help me out when I needed it has been very important

to me I would not have been able to achieve academic success without their sistance

as-Aside from my academic mentors, I’m thankful to my family and friends for theirconstant support and encouragement Their never-ending love and support havegiven me strength and pushed me to do well in school

Finally, I would like to thank myself for persevering and not giving up The ney was difficult, but I am proud of myself for overcoming the challenges andreaching my academic goals

Trang 4

Text-to-speech technology, also known as TTS, is a type of assistive ogy that converts written text into spoken words The overall goal of the speechsynthesis research community is to create natural sounding synthetic speech Cur-rently, there are many speech synthesis engines available on the market, each withits own strengths and weaknesses Some engines focus on generating natural-sounding speech, while others focus on generating expressive speech To increasenaturalness, researchers have recently identified synthesizing emotional speech

technol-as a major research focus for the speech community Expressive speech synthesis

is the ability to convey emotions and attitudes through synthesized speech This

is achieved by adding prosodic features like intonation, stress, and rhythm to thespeech waveform Vietnamese expressive speech research is scarce, to my knowl-edge No datasets from these articles have been released However, significantwork remains in this field A large, high-quality dataset is needed to investigateVietnamese expressive speech

This thesis (1) publishes two Vietnamese emotional speech datasets, (2) proposes

a method for automatically building data, and (3) develops a model for sizing emotional speech The proposed method for automatically building datahelps reduce costs and time by extracting and labeling data from available datasources Simultaneously, the applicability of the presented data is illustrated usingthe proposed emotional speech synthesis model

synthe-Keywords: Speech Synthesis, Text To Speech, Expressive Speech Synthesis, pus Building

Cor-Student

Signature and Name

Trang 5

TABLE OF CONTENTS

INTRODUCTION 1

CHAPTER 1 THEORETICAL BACKGROUND 3

1.1 Speech Features 3

1.1.1 Non-emotional Features 3

1.1.2 Emotional Features 5

1.2 Speech Synthesis 6

1.2.1 Overview 6

1.2.2 Traditional Speech Synthesis Techniques 6

1.2.3 Modern Speech Synthesis Techniques 7

1.3 Expressive Speech Synthesis 10

1.3.1 Introduction 10

1.3.2 ESS Techniques 11

CHAPTER 2 BUILDING VIETNAMESE EMOTIONAL SPEECH DATASET 13 2.1 Surveys 13

2.1.1 Existing Emotion Datasets 13

2.1.2 Data Processing Techniques 14

2.2 Pipeline For Building Emotional Speech Dataset 17

2.2.1 Data Selection 18

2.2.2 Target Speech Segmentation 19

Trang 6

2.2.3 Text Scripting 20

2.2.4 Emotional Labeling 20

2.2.5 Post-Processing 21

2.2.6 Data Augmentation 22

2.3 Label Processing 23

2.3.1 Manual Annotation 23

2.3.2 Automatic Annotation 24

2.4 Dataset Analysis 26

2.4.1 Analysis of Pipeline Errors 26

2.4.2 Text Analysis 27

2.4.3 Emotion Analysis 28

2.5 Released Datasets 31

CHAPTER 3 EMOTIONAL SPEECH SYNTHESIS SYSTEM 32

3.1 Acoustic Model 32

3.1.1 Baseline Acoustic Model 32

3.1.2 Proposed Acoustic Model 36

3.2 Vocoder 39

3.2.1 HifiGAN Vocoder 39

3.2.2 Denoiser Module 40

CHAPTER 4 EXPERIMENTS 42

4.1 Evaluation Strategy 42

4.1.1 Evaluation Metrics 42

Trang 7

4.1.2 Evaluation Design 43

4.1.3 Sheme Design 44

4.2 Experimental Setup 45

4.2.1 Model Configuration 45

4.2.2 Training Settings 47

4.3 Result and Discussion 47

4.3.1 Dataset Evaluation 48

4.3.2 Model Evaluation 49

4.3.3 Discussion 50

CONCLUSION 52

Trang 8

LIST OF FIGURES

1.1 An example of waveform, spectrogram, and mel-spectrogram 4

1.2 Russell’s (1980) circumplex model [8] 5

1.3 An example of modern TTS architecture 7

1.4 Typical acoustic models 8

1.5 Some expressive speech synthesis techniques 11

2.1 Pipeline for building an emotional speech dataset 17

2.2 Audio post-processing 21

2.3 SER model [56] 25

2.4 F0 means in the TTH and LMH datasets 29

2.5 t-SNE visualizations of emotion embeddings in the TTH dataset 30

2.6 t-SNE visualizations of emotion embeddings in the LMH dataset 30 3.1 Pipeline for training baseline acoustic model 32

3.2 Baseline acoustic model architecture 33

3.3 Proposed acoustic model architecture 37

3.4 Detail of Emotion Encoder module 38

3.5 Proposed vocoder 39

3.6 HifiGAN model architecture [66] 39

Trang 9

LIST OF TABLES

2.1 Some emotional datasets 14

2.2 Pipeline errors in the LMH dataset 26

2.3 LMH dataset before and after normalization 27

2.4 Compare manual pipeline and proposed pipeline processing times 27 2.5 Syllable coverage in two datasets 28

2.6 TTH dataset 28

2.7 LMH dataset 28

4.1 Scheme setup 45

4.2 Acoustic model configuration 46

4.3 MOS score of data evaluation 48

4.4 EIR score of data evaluation 48

4.5 SUS score of data evaluation 49

4.6 MOS score in model evaluation 49

4.7 EIR score in model evaluation 50

4.8 SUS score in model evaluation 50

Trang 10

Notation Description

AI Artificial Intelligence

Trang 11

In recent years, TTS has become increasingly popular for general use, as it savestime and makes communication more accessible One promising direction is theuse of expressive speech synthesis, which aims to generate speech that conveysemotional nuances through prosody and other vocal cues Expressive speech syn-thesis has the potential to revolutionize our interaction with technology by mak-ing it more natural and human-like It is a rapidly developing field, and there havebeen many recent advancements in this area worldwide

There are also a lot of technical difficulties related to expressive speech thesis One of the most significant issues is the requirement for huge amounts

syn-of training data Generating speech that sounds human-like and conveys siveness requires a large amount of data, and collecting and classifying this datacan be time-consuming and expensive Another problem is the requirement forstrong algorithms that can deal with variability in speech patterns For example,expressive speech might differ depending on characteristics such as age, gender,and culture Algorithms employed for expressive speech synthesis must be able

expres-to accommodate this variability and provide context-appropriate speech It is tainly possible to record large expressive datasets and apply the same complicatedmodels, but the huge range of languages, speakers, and expressive and affect in-tensities makes this an ineffective experiment Besides that, modern expressiveTTS models must make better use of the limited data they can train on and haveintegrated mechanisms that aid in generating expressive speech, as well as easy-to-interpret controls that are applicable in a variety of circumstances

cer-In Vietnam, there has also been some progress in developing expressive speechsynthesis, but it is still in the early stages of development One of the main chal-lenges facing researchers in Vietnam is the lack of high-quality speech datasets,which can make it difficult to train accurate models for emotion speech synthesis

To the best of my knowledge, there has been little research on Vietnamese pressive speech, such as [1]–[3]; however, none of the datasets contained in thesepapers have been made public Despite this, there is still room for improvement

ex-in this area It is important to acquire a large, high-quality dataset for studyex-ingVietnamese expressive speech

In this thesis, when referring to expressive speech, emotional speech is cally focused on Emotional speech refers to the emotional state of the speakerand is conveyed through variations in tone, pitch, and volume Examples of emo-

Trang 12

specifi-tions conveyed through emotional speech include joy, sadness, anger, and fear.This type of speech helps communicate the speaker’s feelings and can also beused to elicit emotional responses from the listener.

The main contributions of this thesis include:

• Propose a semi-automatic pipeline to build Vietnamese emotional speechdataset

• Release two datasets of Vietnamese emotional speech using the describedpipeline Analyze these Vietnamese emotional corpora and provide view-points

• Develop a model for emotional speech synthesis that is suitable for the ified data objectives

spec-The thesis is organized as follows:

Chapter 1 provides an overview of speech features, speech synthesis, and sive speech synthesis, focusing mostly on the technical side

expres-Chapter 2 describes existing expressive datasets and some basic data processingsteps It then presents TTH and LMH - two Vietnamese emotional speech datasets

- and describes the strategy for the emotional corpus-building pipeline

Chapter 3 presents a baseline and proposed emotional speech synthesis model

Chapter 4 provides experimental results on various instances Additionally, theeffectiveness of the proposed corpus building pipeline is examined

Part Conclusion concludes the thesis and outlines future works

Trang 13

CHAPTER 1 THEORETICAL BACKGROUND

This chapter provides an overview of speech features, speech synthesis, and pressive speech synthesis, focusing mostly on the technical side

ex-1.1 Speech Features

1.1.1 Non-emotional Features

Speech is a signal that contains a lot of information Depending on the purpose

of the analysis, useful information will be extracted and analyzed There are ious feature spaces that characterize speech data This section provides a simpleoverview of the features that are commonly used in Deep Learning architectures

var-a, Spectrogram

A spectrogram is a visual representation of a signal’s frequency spectrum as itchanges over time [4] In the field of speech, spectrograms are often used to ana-lyze and change speech data

The spectrogram represents the frequency content of a spoken signal over time.The x-axis indicates time, while the y-axis indicates frequency At each location

on the spectrogram, the color intensity corresponds to the amplitude or strength

of the corresponding frequency component

Spectrograms are useful for judging how speech sounds, figuring out what netic qualities they have, and learning about how speech sounds They are alsoused in speech synthesis to change speech signals and make synthetic speech

pho-b, Mel Spectrogram

Based on the idea that the human ear is more sensitive to some frequencies thanothers, this property tries to compress the representation of speech in the higherfrequency domain The mel scale is an experimental function that shows howsensitive the human ear is to different frequencies

The mel-spectrogram, which is based on the auditory-based mel-frequency scale,gives more frequency resolution than the spectrogram [5]

Trang 14

Figure 1.1: An example of waveform, spectrogram, and mel-spectrogram.

c, Acoustic/Prosodic Features

Acoustic features are the physical qualities of the sound waves that the vocal tractmakes [6] These include parameters such as pitch, loudness, and duration ofphonemes (units of sound) They provide information about the speaker’s emo-tions, attitudes, and intentions Pitch describes the intonation of a sentence, whileenergy features cover the intensity of the uttered words Duration stands for thespeed of talking and the number of pauses Two more classes that do not directlybelong to prosody are articulation (formants and bandwidths) and zero crossingrate These deduced features are obtained by measuring statistical values of theircorresponding extracted contours, such as mean, median, minimum, maximum,range, and variance

On the other hand, prosodic features are the patterns of stress, tone, and rhythm

in speech [7] These features play an important role in conveying meaning andemotion in human communication In speech synthesis, prosodic features need

to be carefully modeled and synthesized in order to create a realistic-soundingsynthetic speech

Trang 15

1.1.2 Emotional Features

Valence and arousal are two key features used to describe emotions in speech [8].Arousal refers to the intensity or level of activation of emotion, while valencerefers to the positive or negative nature of the emotion

Figure 1.2: Russell’s (1980) circumplex model [8]

Arousal is a measure of how strong or active an emotion is, and it can rangefrom low to high Excitement is an example of a high-arousal emotion, whilecalmness is an example of a low-arousal emotion Arousal plays a significant role

in various areas, including cognitive performance and decision-making Moderatelevels of arousal can enhance cognitive performance, while high levels of arousalcan impair it High arousal can also promote risk-taking behavior

Valence ranges from positive to negative, with positive emotions being pleasantand desirable and negative emotions being unpleasant and undesirable Exam-ples of positive emotions include happiness and love, while examples of negativeemotions include sadness and fear

A combination of arousal and valence can be used to describe a wide range ofemotions Excitement is an example of high arousal and positive valence, whilesadness is an example of low arousal and negative valence

Understanding the arousal and valence of emotions is useful in various contexts,such as psychology research and the development of products or services that aim

to elicit specific emotional responses from users

Trang 16

1.2 Speech Synthesis

1.2.1 Overview

Text-to-Speech (TTS) technology, often known as Speech Synthesis, is the cess of transforming written text into spoken words The earliest TTS systemswere created in the 1950s [9]; therefore, this technology has been present forseveral decades Recent advancements in Artificial Intelligence (AI) and Natu-ral Language Processing (NLP) technology have made TTS more advanced andpopular

pro-TTS systems work by analyzing written text and translating it into spoken wordsthat a listener can hear The process involves several steps, such as text analysis,phonetic transcription, and speech synthesis Firstly, the text analysis step breaksdown the written text into smaller chunks, such as sentences or phrases, whichcan then be analyzed for their meaning and context Secondly, the phonetic tran-scription step converts the written words into their corresponding sounds, using

a set of rules or a pre-defined dictionary Finally, the speech synthesis step usesthese sounds to produce audibly spoken words for a listener

Speech synthesis, or TTS, has many applications across various domains One

of the most important is accessibility, providing speech output for people withvisual impairments or reading difficulties TTS is also used in education to createaudio textbooks and other educational materials Additionally, TTS is used inentertainment for voiceovers and character dialogue in video games, animations,and other media Personal assistants such as Siri and Google Assistant also useTTS technology to provide voice responses to user queries TTS is even used

in customer service, in interactive voice response systems to provide automatedvoice responses to customer queries, reducing the burden on human resources andspeeding up the process

1.2.2 Traditional Speech Synthesis Techniques

The early techniques of speech synthesis were based on rule-based systems Thesesystems were designed to convert text into speech by following a set of predefinedrules The resulting speech was often robotic and lacked naturalness

In the 1980s, a new technique called concatenative synthesis was introduced[10], [11] Concatenative synthesis involved breaking down recordings of humanspeech into small units, such as phonemes or diphones, and then recombining

Trang 17

them to create new speech The lack of high-quality speech recordings limitedthis method even though it produced more natural-sounding speech.

In the 1990s, the use of neural networks changed the way speech synthesis wasdone Neural networks are a set of computer algorithms that model the innerworkings of the human brain They are designed to find patterns and links indata, and they can be taught to do many things, such as speech synthesis [12]

In the early 2000s, researchers began exploring the use of neural networks forspeech synthesis [13] Neural TTS is a technique for speech synthesis based ondeep learning that employs neural networks to synthesize speech The neural net-work is trained on a huge corpus of speech recordings and text transcripts Thenetwork then finds how to map text to speech by predicting the acoustic charac-teristics of speech from the corresponding text The potential of neural TTS togenerate more natural and expressive speech is one of its advantages By train-ing on a large corpus of speech data, neural TTS is able to capture differences inspeech patterns and intonation Additionally, it can generate speech with variousemotions and speaking styles

1.2.3 Modern Speech Synthesis Techniques

Speech synthesis has come a long way since the time when it was associatedwith robotic and mechanical speech Due to improvements in deep learning, deeplearning-based speech synthesis has been developed to generate more human-like speech In deep learning-based synthesis, huge amounts of data are given toartificial neural networks to create speech

Typically, modern speech synthesis models consist of two major components:the acoustic model and the vocoder The acoustic model derives acoustic featuresfrom linguistic data or directly from phonemes or characters, whereas the vocodergenerates acoustic signals based on the acoustic features received from the acous-tic model

Text

Acoustic

Speech Features

Synthetic Speech Figure 1.3: An example of modern TTS architecture

Trang 18

a, Acoustic Model

As TTS development progressed, various audio models were adopted, includingHMM, DNN, and sequence-to-sequence models The most recent feed-forwardnetworks are used for parallel generation

Acoustic models generate acoustic features using vocoders TTS pipeline designsdepend on the acoustic features selected, such as mel-cepstral coefficients (MCC),bark-frequency cepstral coefficients (BFCC), mel-generalized coefficients (MGC),band aperiodicity (BAP), voiced/unvoiced (V/UV), (fundamental frequency) F0,and mel-spectrograms [14]

Acoustic Model

SPSS Model

Neural - based Model

RNN-based CNN-based Transformer-based Figure 1.4: Typical acoustic models

Accordingly, the acoustic models can be divided into two categories: (1) tic models in statistical parametric speech synthesis (SPSS) that predict acousticfeatures such as MGC, BAP, and F0 from linguistic features, and (2) acousticmodels in neural-based end-to-end TTS that predict acoustic characteristics such

acous-as mel-spectrograms from phonemes or characters

SPSS models use statistical models like HMM and RNN to turn linguistic tures into speech parameters The generated speech parameters are converted intospeech waveforms using vocoders like STRAIGHT [15] and WORLD [16] Thedevelopment of these acoustic models is directed by different aspects, includingintegrating more contextual input information, modeling the correlation betweenoutput frames, and better handling the over-smoothing prediction problem [14]

fea-Compared to SPSS models, acoustic models in neural-based end-to-end TTS havesignificant advantages Neural models implicitly learn alignments through atten-tion or jointly predict the duration, making them more comprehensive and requir-ing less preprocessing Furthermore, as the modeling power of neural networksincreases, linguistic features are reduced to character or phoneme sequences,and acoustic features have shifted to high-dimensional mel-spectrograms or evenhigher-dimensional linear spectrograms

Trang 19

• RNN-based models (e.g., the Tacotron series [17], [18]) use an attention-decoder structure to generate linear spectrograms from characters.This approach significantly improves speech quality compared to previousmethods such as concatenative TTS, parametric TTS, and neural TTS.

encoder-• CNN-based models (like DeepVoice [19] series) use convolutional neuralnetworks to generate waveforms DeepVoice 3 [20] employs a more compactsequence-to-sequence model and directly predicts mel-spectrograms, ratherthan complicated linguistic aspects like DeepVoice 1 [19] and 2 [21]

• Transformer-based models, such as the FastSpeech series [22], [23] and formerTTS [24] use a Transformer-based encoder-attention-decoder archi-tecture to create mel-spectrograms from phonemes However, Transformer’sencoder-decoder attentions are less robust than those of RNN-based modelslike Tacotron, which use stable attention techniques like location-sensitiveattention

Earlier neural-based acoustic models like Tacotron 1/2, DeepVoice 3, and formerTTS used autoregressive generation but had several problems Firstly, gen-erating auto-regressive mel-spectrograms was slow, especially for lengthy speechsequences Secondly, an encoder-attention-decoder-based auto-regressive gener-ation often resulted in significant word skipping and repeating due to erroneousattention alignments between text and mel-spectrograms

Trans-FastSpeech [22] is a text-to-speech solution that quickly generates parallel spectrograms using a feed-forward Transformer network It avoids word skippingand repetition issues by removing the attention mechanism and using a lengthregulator This allows for fast and robust speech synthesis FastSpeech 2 [23] im-proves on FastSpeech by using ground-truth mel-spectrograms as training goalsand including additional variance information This results in higher voice qualitywhile maintaining speed and control

mel-b, Vocoder

There have been two main stages in the development of vocoders: those thatuse SPSS and those that use neural networks Prominent SPSS vocoders includeSTRAIGH and WORLD Early neural vocoders such as Char2Wav [25] WaveNet[26], and WaveRNN [27] generate waveforms using linguistic features as input.Later, WaveGlow [28], FloWaveNet [29], MelGAN [30], and Parallel WaveGAN[31] take mel-spectrograms as input and generate waveforms Due to the length

Trang 20

of the speech waveform, auto-regressive waveform generation requires able inference time Thus, waveform generation employs generative models such

consider-as Flow, VAE, GAN, and DDPM [32] Accordingly, the neural vocoders can becategorized as follows: auto-regressive, flow-based, GAN-based, VAE-based, anddiffusion-based

1.3 Expressive Speech Synthesis

1.3.1 Introduction

Expressive speech synthesis (ESS) is the process of creating speech that conveysemotion Expressive speech synthesis has many applications In the entertain-ment industry, it can create more realistic and engaging virtual characters that be-come more relatable by speaking emotionally This can improve user immersionand revolutionize the gaming industry In healthcare, expressive speech synthesishelps speech-impaired people express their emotions, which can improve mentalhealth It also allows speech-impaired people to communicate with family, doc-tors, and caregivers In education, students can learn to recognize and interpretspoken expressive cues by generating expressive speech This boosts expressiveintelligence and communication, particularly for second-language learners

Over the years, this technology has evolved significantly It is now possible togenerate speech that sounds very human-like with a Mean Opinion Score (MOS)

of 4.43 [33] However, artificial speech still has only one tone and cannot beapplied to some applications that require changes in tone, such as those mentionedabove Not many ESS publications have been widely applied to actual products.With Vietnamese, a few studies can be mentioned as [2], [34], [35]

There are also several technical challenges associated with emotional speech thesis The challenge is the need for robust algorithms that can handle variability

syn-in speech patterns For syn-instance, expressive speech can vary based on factors such

as age, gender, and culture Algorithms used for expressive speech synthesis must

be able to handle this variability and generate speech that is appropriate for thecontext Another challenge is the need for large amounts of training data Gener-ating speech that sounds human-like and conveys expressiveness requires a lot ofdata, and collecting and labeling this data can be time-consuming and expensive.Some other papers refer to expressive Vietnamese speech datasets such as [1],[3], but these datasets are not publicly available, making Vietnamese ESS moredifficult to access

Trang 21

The expressive speech clearly conveys an idea or message It conveys meaningthrough tone, volume, and emphasis Expressive speech is used in public speak-ing, presentations, and instructions Emotional speech, on the other hand, ex-presses emotions Emotional speech is used in social situations to express hap-piness, sadness, anger, or excitement Emotional speech conveys the speaker’semotions and connects with the audience When referring to expressive speech inthis thesis, emotional speech is focused on.

There are several emotions in the domain of speech, including but not limitedto: Happiness, Sadness, Anger, Fear, Disgust, and Surprise These six emotionsare commonly used in the study of speech The number of emotions that can

be synthesized in speech varies depending on the technology and the level ofcomplexity involved [36]

1.3.2 ESS Techniques

The purpose of TTS is to generate natural and understandable speech The pressiveness of the synthetic speech, which depends on a number of variablesincluding the speech’s content, timbre, prosody, emotion, and style, is crucial tothe naturalness of the speech Research on expressive TTS includes modeling,disentangling, manipulating, and transferring content, timbre, prosody, style, andemotion, among other issues [37]

ex-To enhance the expressiveness of synthesized speech, it’s crucial to input andmodel variation information This information, which includes timbre, style, ac-cent, speaking rate, etc., can be altered during inference to control synthesizedspeech [38] By providing various information related to a different style, thespeech can be adapted to that style

Advanced Generative

Models

Figure 1.5: Some expressive speech synthesis techniques

One way to do expressive speech synthesis is to train a TTS model with a labeleddatabase, including emotional categories The style label gives the TTS modelconditional information to produce speech with the same style However, this ap-

Trang 22

proach can only learn an average representation for each style, making it ble to develop different speech styles for texts in the same category Humans cantransmit the same style of speech with subtle variations, but models can’t An-other technique to accomplish expressive synthesis is to use knowledge as input

impossi-to improve the models Users can collect this information in various ways, such

as retrieving language and speaker identifiers, style, and prosody from labelingdata, or extracting pitch and energy content from speech and duration from pairedtext and speech data

In some situations, explicit labels may not be provided, or labeling may require

a lot of human effort and may not include detailed or fine-grained variation formation In such instances, implicit modeling techniques can be used Here aresome common examples [14]:

in-• Reference Encoder: [39] proposed a method to simulate prosody using a erence encoder that extracts prosody embeddings from reference audio andfeeds them into the decoder The reference audio is used to synthesize speechwith identical prosody Style tokens can be utilized to synthesize speech

ref-• Variational Autoencoder: [40] used the VAE to model variance information

in the latent space, with a Gaussian prior for regularization This approachenables expressive modeling and control of synthesized styles Other workshave also leveraged the VAE framework to better model variance informationfor expressive synthesis

• Advanced Generative Models: Advanced generative models can help solvethe one-to-many mapping problem and stop predictions from being too smooth.These models implicitly learn variation information and better model multi-modal distributions

An alternative method is predicting speech style from input text Compared tolabel-based and reference audio-based approaches, directly predicting speakingstyle from input text is more practical and adaptable This method allows the TTSsystem to avoid relying on manual labels or reference audio during inference.Text-Predicted Global Style Token, a token extension proposed in [41], predictsstyle rendering from text alone This enables automatic style speech generationwithout explicit labels or reference audio Recent studies have attempted to predictstyles at a finer level, such as the phoneme or word level, instead of the globallevel

Trang 23

CHAPTER 2 BUILDING VIETNAMESE EMOTIONAL SPEECH

DATASET

This chapter first discusses expressive datasets and basic data processing ods Then, it presents an approach for creating an emotional corpus developmentpipeline and highlights two first public Vietnamese emotional speech datasets,TTH and LMH, built using this pipeline These datasets have two main purposes:

meth-to create a Vietnamese speech synthesis system based on an emotional corpusand to serve as a resource for other multi-speaker emotional Vietnamese speechdatasets

2.1 Surveys

2.1.1 Existing Emotion Datasets

There are numerous methods for creating emotional datasets, depending on theirintended application In general, the speech corpus used to generate emotionalspeech can be split into three categories: acted (simulated), elicited (evoked/induced),and natural speech (spontaneous), each of which has its own advantages and dis-advantages

Natural speech datasets can be obtained most readily from television chat shows,interviews, films, vlogs, and call centers This speech data has the advantage ofcontaining the most realistic and authentic emotions when the speaker is in anatural state Unfortunately, this method’s recording quality is problematic, andemotional displays are typically spontaneous (the phonetic field would be lim-ited)

Recording with professional actors or speech talents makes it easier to control theperformed speech than natural speech This type of recording could inflate theemotions, and it could be noticed that they are not real; however, the listener canperfectly recognize them Most of the emotional databases collected are of thistype

In the elicited speech, speakers are placed in a specific emotional situation, andtheir speech is recorded afterward Different speakers react differently to the samesituation, so they cannot be sure of the type of emotion that will be recorded

Table 2.1 details some popular emotional datasets used in emotional speech search

Trang 24

re-Table 2.1: Some emotional datasets

1 IEMOCAP English

neutral, angry, happy, sad, surprise

12.5 hours 10.039 utterances

10 speakers (5M, 5F)

Acted and Evoked

No (restricte

- The dataset includes a variety of utterances, a rich vocabulary, multi-modal input, and easy-to-interpret emotional annotations.

2 EMO-DB German

neutral, angry, anxiety, boredom, disgust, happy, sad

535 utterances

10 speakers (5M, 5F) Acted Yes

3 RAVDESS English

neutral, angry, calm, disgust, fearful, happy, sad, surprise

7.356 utterances

24 speakers (12M, 12F) Acted Yes

- Ryerson Audio-Visual Database of Emotional Speech and Song

- Each emotion (except for neutral) has two levels of intensity: normal and strong.

- Emotions are easily recognizable.

- The same sentence can be spoken in different tones.

- The vocabulary is limited.

- The recorded sentences are not sufficient for expressive TTS.

4 SAVEE English

neutral, angry, disgust, fearful, happy, sad, surprise

480 utterances

4 speakers (all M) Acted Yes

- The Surrey Audio-Visual Expressed Emotion Database.

- Not suitable for expressive TTS because the number of recorded sentences is limited (12 sentences/speaker emotion).

5 CREMA-D English

neutral, angry, disgust, fearful, happy, sad

7.442 utterances

91 speakers (48M, 43F) Acted Yes Crowd-sourced Emotional Multimodal Actors Dataset.

6 TESS English

neutral, angry, disgust, fearful, happy, pleasant, sad, surprise

2.800 utterances

2 speaker (all F) Acted Yes Toronto Emotional Speech Set.

7 CHEAVD Chinese

neutral, angry, disgust, fearful, happy, sad, surprise

2.3 hours

238 speakers Natural No

- Audio-visual database.

- Extracted from films, TV plays and talk shows.

- Aging from child to elderly.

- Multi-emotion labels and fake/suppressed emotion labels.

8 ShEMO Sharif

neutral, angry, fearful, happy, sad, surprise

3.000 utterances

87 speakers 3.5 hours

Natural Yes - Sharif Emotional Speech Database.- Extracted from online radio.

9 EmoVo Italian

neutral, angry, disgust, fearful, happy, sad, surprise

1 hours

588 utterances

6 speakers Acted Yes

Although there are numerous open-source emotional datasets available, none ofthem are sufficient for expressive speech synthesis, to the best of my knowledge.Some of the limitations of emotional speech databases are briefly mentioned be-low:

• Most emotional speech databases do not simulate emotions in a natural andclear way, as evidenced by the relatively low recognition rates of humansubjects

• In some databases, the quality of the recorded utterances is poor, and thesampling frequency is low

• Phonetic transcriptions are not provided with some databases, making it ficult to extract linguistic content from the utterances

dif-2.1.2 Data Processing Techniques

Processing data is an important part of any NLP project, and speech synthesismodels are a big part of that To clarify the data processing techniques used in the

Trang 25

dataset construction process, this subsection will describe some common stagesinvolved in processing TTS data.

a, Text Processing

• Text cleaning: Text cleaning is an important step in text-to-speech ing that ensures accurate and high-quality output Some common steps fortext cleaning include removing unnecessary characters or symbols, correct-ing spelling and grammatical errors, and converting text to the appropriateformat for speech synthesis Additionally, it’s important to consider the in-tended audience and adjust the text accordingly for factors such as tone andformality

process-• Sentence segmentation: Sentence segmentation is a crucial step in speechsynthesis that involves breaking down a long text into shorter utterances orsentences This ensures that the synthesized speech sounds more natural and

is easier to understand In this step, the original text is broken up into shortphrases based on the punctuation marks at the end of each sentence

• Text normalization: Text normalization involves converting non-standard ten text into spoken-form words, making them easier to pronounce for TTSmodels For example, the year "2023" is normalized to "hai không hai ba,"and "22/04" is normalized to "hai hai tháng tư" Early text normalizationtechniques were rule-based, but later neural networks were used to modeltext normalization as a sequence-to-sequence task, with the source and tar-get sequences being non-standard words and spoken-form words Recently,some works have proposed combining the advantages of both rule-based andneural-based models to further improve the performance of text normaliza-tion

writ-• G2P conversion: Grapheme-to-phoneme (G2P) conversion is the process ofconverting characters (graphemes) into pronunciations (phonemes) This cangreatly ease speech synthesis For example, the word "tiếng" is converted into

"t ie5 ngz" A manually collected grapheme-to-phoneme lexicon is usuallyused for conversion However, for alphabetic languages like Vietnamese, thelexicon cannot cover the pronunciations of all words Therefore, G2P conver-sion for Vietnamese is mainly responsible for generating the pronunciations

of out-of-vocabulary words

After analyzing the above text, linguistic features will be used as input in later

Trang 26

stages of the TTS pipeline, such as acoustic models Generally, linguistic featuresare constructed by combining the results of text analysis at many levels, includingthe phoneme, syllable, word, phrase, and sentence levels.

b, Audio Processing

• Audio normalization: Audio normalization is the process of adjusting thevolume levels of an audio file to a consistent level During the recording ofaudio, volume levels may vary due to factors such as microphone placement,the recording environment, and distance from the source This variation involume levels can be distracting and affect the overall quality of the record-ing Techniques like peak, RMS, and LUFS normalization can be used toachieve this Normalizing audio levels can improve the sound quality, mak-ing the recording more pleasant to listen to and easier to edit

• Noise reduction: Noise reduction removes unwanted background noise fromaudio recordings, improving clarity and intelligibility Background noise cancome from various sources like electrical interference, air conditioning, andoutside sounds There are two main techniques: spectral subtraction andadaptive filtering Spectral subtraction subtracts the noise spectrum fromthe audio spectrum, while adaptive filtering analyzes the audio signal andcreates a filter to remove unwanted noise

• Audio segmentation: Audio segmentation is a crucial step in speech cessing It involves dividing an audio signal into meaningful segments or re-gions based on specific criteria The goal of audio segmentation is to identifydifferent parts of the speech signal that correspond to different phonemes,words, or phrases Audio segmentation is useful in many speech-processingapplications In speaker recognition, segmenting the speech signal into speaker-specific regions can help identify the speaker more accurately In emotionrecognition, segmenting the speech signal into emotional regions can helpidentify the emotional state of the speaker

pro-• Trim silence: During the audio recording process, there is often an initialperiod of silence before the actual speech or sound begins and a period ofsilence at the end of the recording after the speech or sound has ended Theseperiods of silence can negatively impact the overall quality of the audio files

To address this issue, it is common practice to trim the silence from thebeginning and end of the audio files By trimming the silence, the resulting

Trang 27

audio files have a cleaner sound and better overall quality.

2.2 Pipeline For Building Emotional Speech Dataset

Data Selection

Target Speech Segmentation

ASR Labeling

Post-Processing

Transcript Emotion

Labels

Clean Audio

Figure 2.1: Pipeline for building an emotional speech dataset

With languages that have limited resources like Vietnamese in the speech main, as far as I know, there has not been a publicly available emotional speechdataset for research purposes This is because collecting expressive speech data

do-is challenging It do-is harder to find and of lower quality compared to regular TTS.Additionally, the speech contains various acoustic and acoustical variations, mak-ing it difficult to have sufficient training data for all emotions and tones, and notall emotions can be effectively synthesized

Therefore, I propose a method to construct a Vietnamese emotional speech datasetfor the research community and also publish two emotional speech datasets (TTHand LMH) with some differences in the data processing pipelines The TTHdataset is a hybrid combination of automatic and manual labeling, while the LMHdataset will be constructed automatically, with minimal human editing

The procedures for doing so are outlined below

Trang 28

2.2.1 Data Selection

The first step in creating a corpus is to collect audio data This can be obtainedfrom various sources, such as recorded speeches, phone conversations, or otheraudio files The quality of the audio data is an important factor that affects theaccuracy of the TTS model

For this research project, the natural speech approach to building Vietnameseemotional speech datasets has been chosen This approach has a number of advan-tages First, it allows for the retention of natural emotions Second, the necessarydata is easily accessible and immediately available, saving time and money

To design a speech corpus for Vietnamese, certain criteria must be defined Thecorpus will be used to study prosody, pronunciation, and build expressive speechsynthesis models The following requirements have been identified:

• Large quantity of data from one speaker

• Good sound quality and consistent speech

• Varied discourse styles and literary genres

• Emotion conveyed in speech

• Consistent timbre

After a number of experiments and trials, it was decided that the main focusshould be on the recordings of Tang Thanh Ha (TTH), a Vietnamese actressknown for her roles in acclaimed films shown on national TV The majority ofher films fall under the category of psychological dramas that were produced dur-ing the time period of 2008–2013, where the technique of direct recording ofthe actor’s speech without the inclusion of dubbing or the mixing of music waspredominantly used

To ensure that my research is comprehensive and thorough, all speech data ofTTH from all 48 of her movies, interviews, and talk shows from ages 20-24 werecollected This ensures consistent data without too many variations in speechchanges that could affect analysis and interpretation

The main focus of the research was on TTH, but data about another actor, LuongManh Hai (LMH), whose information is very similar to that of TTH, was alsoobtained LMH’s data came from a large number of films he acted in when hewas only 27 to 29 years old

Trang 29

Data crawling collects the highest-quality video clips available The audio is thenextracted to a WAV file using the "SoX" command1 Due to the diversity of thesources from which the data is gathered, there are numerous variations in acousticconditions Therefore, all audio is reduced to a sampling rate of 22,050 Hz and

16 bits The next step is to preprocess the acquired data

2.2.2 Target Speech Segmentation

Once audio with multiple speakers, silence, and background noise have been lected, the audio must be extracted into segments containing only the speech ofthe target speaker This process requires Target Speech Segmentation

col-In data processing, Target Speaker Segmentation involves separating a specificspeaker’s speech from background noise and other speakers Speaker Diarizationdivides a conversation with multiple speakers into segments spoken by the samespeaker Speaker Verification verifies a speaker’s identity based on their speechcharacteristics For this thesis, audio segments of the target speaker are obtainedusing Speaker Diarization and Speaker Verification

The first step in the pipeline involves segmenting the 5-45 minute-long audio fileinto smaller, more manageable 1-10 second-long audio clips This is achievedthrough the use of a pre-trained Speaker Diarization model [42], which is capable

of isolating human speech from other background noises The resulting audioclips contain only the human speech that is of interest for further processing

In the second step, these audio clips are passed through a pre-trained SpeakerVerification model [43], which is used to identify the target speaker’s audio seg-ments This is achieved by manually extracting 10- to 120-second audio segmentscontaining only the target speaker’s speech, which are then passed through theSpeaker Verification model This model is trained to recognize the unique char-acteristics of the target speaker’s speech and is able to identify their speech seg-ments with a high degree of accuracy

Then, the audio segments containing the target speaker’s speech were obtained

To ensure data quality, files containing the speech of another speaker are manuallyfound and excluded from further processing

Trang 30

2.2.3 Text Scripting

In a speech synthesis system, the corresponding text is essential in generating thespeaker’s artificial speech In this step, an ASR model is used to convert spokenlanguage into text scripts

Automatic Speech Recognition (ASR) is a technology that enables computers torecognize and transcribe spoken language into text By using ASR, companiescan quickly process and analyze large amounts of audio data, which would beimpossible to do manually Popular ASR models include Google Speech-to-Text,Amazon Transcribe, and Microsoft Azure Speech Services

Several articles on Vietnamese automatic speech recognition (ASR) models withhigh accuracy have been published, including [44]–[46] At this stage of thethesis, a state-of-the-art pre-trained Vietnamese Automatic Speech Recognition(ASR) model [47] has been decided to be used The model was trained on adataset that includes 5,000 YouTube speakers with 2,000 hours of speech data in

an environment similar to that in my thesis The diversity of the training datasetensures that the model can handle a wide range of speech inputs, making it suit-able for various real-world applications Additionally, this model achieved a Syl-lable Error Rate (SyER) of 4.17% and ranked first in the 8th International Work-shop on Vietnamese Language and Speech Processing (VLSP 2021)

After completing this step, the text file will contain transcripts for each audio filecreated The text file may have recognition mistakes due to noise, distortion, andaccents

2.2.4 Emotional Labeling

An emotional speech dataset is incomplete without emotion labels

The first thing to determine is which emotion labels will be included in thedataset In this thesis, the emotions chosen are neutral, happy, sad, and angry.These four emotions were selected because they are prevalent in the majority ofthe world’s datasets [48] By choosing these emotions, the classification processcan be simplified, and the dataset can be representative of the most common emo-tions

In order to accurately label each audio file, a thorough analysis of the speech data

is necessary This involves listening to and transcribing each audio file, as well asidentifying the emotional content Once the emotion labels have been assigned,

Trang 31

the data can be used for further analysis and research in the field of emotionspeech recognition.

Emotional labeling is a complex process, and finding a consensual definition is achallenging task This stage will be discussed in further detail in Section 2.3

2.2.5 Post-Processing

Raw Audio

Noise Removing Amplifying Trimming Cleaned Audio

Figure 2.2: Audio post-processing

After the aforementioned four steps, the dataset contains corresponding audiofiles, text scripts, and emotion labels However, to be used effectively in speechsynthesis systems, the data must undergo further processing to achieve optimalquality

The text in the dataset, extracted from the ASR model, does not require tation or normalization, only occasional correction of typos or inconsistenciescaused by ASR errors My experiments showed many long periods of silence inaudio without any corresponding prosodic punctuation, negatively impacting TTSperformance To address this issue, punctuation was inserted into the transcript atthe positions of corresponding silences Using the aligned time stamps provided

segmen-by the ASR system, the duration of internal silence is calculated and prosodicpunctuation is inserted into the transcript at corresponding silence positions Thesilence duration threshold for inserting punctuation is set to 0.35 seconds

With the Southern dialect data feature of the two speakers, the spelling of wordsclosely matches their pronunciation For example, the sound ’v’ is pronounced as

’dz’ (’và’ → ’dzà’, ’vâng’ → ’dzâng) or some words are tone-masked (’ông’ →

’ổng’) In order to help the artificial speech retain the regional characteristics aswell as some of the unique characteristics of the speaker, the mispronunciations

Trang 32

with the text script were corrected.

After this step, all the texts consist of 29 letters and other punctuation marks, such

as periods and commas

To process the audio data, silence trims were performed initially For this, a ple script was created to trim the audio based on the threshold of sound amplitude

sim-To ensure the accuracy of the trims, audio data were manually double-checked ing Audacity audio software This step ensured that the silences were eliminatedcorrectly

us-The dataset consists of spontaneous data captured in many environments and atvarying volumes As a result, the data contains a lot of noise and invaluable sig-nals Using a pre-trained Speech Enhancement model [49], the signal was en-hanced and the noise was removed to improve this data This process involvedeliminating background music and enhancing the audio waveform’s clarity

To handle volume differences between distinct utterances, the peak amplitude ofall audio clips was normalized to a level of 1-3 dB Noise in silent parts was alsoremoved to achieve rhythm uniformity Finally, the silence portions at the startand finish of utterances were adjusted to 0.2 seconds

2.2.6 Data Augmentation

In the expressive speech domain, it is common for audio data to have many ples for certain classes and fewer samples for other labels An imbalance betweenthe neutral labels and the rest of the labels in both sets of data that were designedwas found Additionally, the dataset lacked audio hours To address these issuesand improve performance, overfitting data augmentation techniques were imple-mented

exam-Oversampling creates more training data by copying existing data This technique

is useful in low-resource languages It improves model performance and reducesoverfitting, where the model memorizes training examples rather than generaliz-ing to new data

In speech synthesis, oversampling can be used to create more training examples

by modifying the original audio using techniques such as pitch shifting, timestretching, and noise injection These random transformations expose the model

to a wider range of variations in the data, improving its accuracy and robustness.However, in my case, altering speech properties can affect emotional features andmake the artificial speech sound unnatural Additionally, cross-language transfer

Trang 33

learning is not suitable for my Vietnamese data because it is a tonal language,unlike English which uses stress and rhythm for expression [50].

Due to the limited amount of speech data available for the happy, sad, and angryemotions, the data on these emotions was multiplied by three This was done inorder to reduce the imbalance between the different emotion labels and to preventoverfitting when training the ESS model on these emotions By doing this, themodel can be better trained to achieve higher accuracy in its predictions

2.3 Label Processing

The following two subsections will discuss the two methods used to label tional data in the two datasets: manual annotations for the TTH dataset, and auto-mated annotations for the LMH dataset

emo-2.3.1 Manual Annotation

Manual annotations in emotional speech labeling refer to the process of manuallylabeling emotions in speech data This involves listening to audio recordings ofspeeches and identifying the emotions being conveyed by the speaker

Manual annotations are the most reliable and accurate method of labeling datasince humans are better at recognizing and interpreting emotions than machines.However, it is a time-consuming and labor-intensive task Moreover, emotionsare subjective and can be interpreted differently by different people, which canlead to inconsistencies in labeling Additionally, emotions are complex and multi-dimensional, making it difficult to assign a single label to a speech sample Speechsamples may convey multiple emotions simultaneously or switch between emo-tions rapidly, making accurate annotation challenging A lack of standardized la-beling schemes can also lead to inconsistencies in labeling and make it difficult

to compare results across studies

Usually, to ensure objectivity in emotional labeling, emotional audio is heardand voted on by a group of people The chosen emotion must be approved by amajority of the group’s members If there are not enough listeners, it is suggestedthat labels be interpreted in context, particularly in conjunction with the discoursemode

However, it is important to note that this method can sometimes be problematic,

as emotions can be subjective and vary greatly among individuals Additionally,

Trang 34

cultural and linguistic differences can also play a role in how emotions are ceived and labeled.

per-During the building of the TTH dataset, since there was only one person labelingthe video, the audio containing speech was listened to and placed in the context of

a sentence based on the original video clip as well as the script text This approachaimed to ensure that the emotional labeling was as accurate as possible despitethe limited number of listeners Moreover, the context of the sentence helped toprovide a more complete understanding of the emotional content, allowing for aricher and more nuanced emotional labeling process

In this subsection, the method for automatic labeling of emotional speech datawill be discussed

a, Speech Emotion Recognition for Vietnamese

Speech Emotion Recognition (SER) is a field that aims to recognize human tions through speech signals Techniques used in SER include acoustic featureextraction, machine learning, and deep learning Acoustic features, such as pitchand intensity, are extracted from speech signals and used to classify emotions.One of the main challenges in SER is the variability in speech patterns due to fac-tors such as language, culture, and accent Different languages and cultures mayexpress emotions differently, making it difficult to develop universal SER modelsthat can be applied across different populations

emo-Most of the research on SER has been conducted in English, with state-of-the-art(SOTA) results reaching 75-80% [51]–[54] However, this limits the applicability

of SER in low-resource languages such as Vietnamese While some studies havebeen done in Vietnamese, such as [55] these publications use their own datasets,making them unsuitable for creating SER models for this thesis

The TTH dataset was inferred for emotion labeling with models that were able to the public However, only 35% of the predicted labels matched the manual

Ngày đăng: 03/07/2023, 22:06