Efficient neural speech synthesis = cải tiến tổng hợp tiếng nói sử dụng học sâu

-2.2 Deep learning Deep learning is a subﬁeld of machine learning that emphasizes learning sive layers of increasingly meaningful representations when learning representationsfrom data..

Trang 1

HA NOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER’S THESIS

Efficient Neural Speech Synthesis

LAM XUAN THU Data Science and Artificial Intelligence

HANOI, 2021

Trang 2

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER’S THESIS

IN DATA SCIENCE AND ARTIFICIAL INTELLIGENCE Efﬁcient Neural Speech Synthesis

LAM XUAN THU thu.lx202712m@sis.hust.edu.vn

—————————————-Department: Computer Science

HANOI, 12/2021

Trang 3

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập – Tự do – Hạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn: Lâm Xuân Thư

Đề tài luận văn: Cải tiến tổng hợp tiếng nói sử dụng học sâu

Chuyên ngành: Khoa học dữ liệu và Trí tuệ nhân tạo

Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 24/12/2021 với các nội dung sau:

- Thêm nội dung về đặt vấn đề và mục tiêu ở chương 1.

- Tách Related work ra khỏi chương 2 Background để tạo thành chương riêng.

- Phân tích thêm về các nghiên cứu liên quan (chương 3).

- Phân tích lý do xây dựng các thành phần của mô hình đề xuất (chương 4).

- Chi tiết hơn phần thí nghiệm (chương 5).

- Trình bày lại một số nội dung lý thuyết được tham khảo.

Ngày tháng năm 2022

CHỦ TỊCH HỘI ĐỒNG

Trang 4

Declaration of Authorship and Topic

Sentences

Personal Information

Information and Communication Technology

Goals of the Thesis

• Proposing a novel neural network for Speech Synthesis

• Conducting experiments and evaluating the proposed model

Main Tasks of the Thesis

• Introduce the Speech Synthesis problem and review traditional approaches forthis problem

• Present machine learning and deep learning background for Speech Synthesisproblem

• Propose a novel neural network-based system for Speech Synthesis

• Implement experiments and evaluation

Trang 5

• Conclude and outline future developments

Declaration of Authorship

I - Lam Xuan Thu - hereby warrant that the work and presentation in this thesis are performed by myself under the supervision of Dr Dinh Viet Sang

All results presented in this thesis are truthful and are not copied from any other works

Hanoi, 24th Nov 2021

Author

Lam Xuan Thu

Attestation of the Supervisor on the Fulﬁllment of the Requirements for the Thesis:

Hanoi, 24th Nov 2021 Supervisor

Dr Dinh Viet Sang

Trang 6

I am extremely grateful to my supervisor, Dr Dinh Viet Sang, who gave me thegolden opportunity to do this wonderful project on the topic of Speech Synthesis,which also helped me in doing a lot of research and I came to know about somany new things It was a great privilege and honor to work and study underhis guidance I would also like to express my gratitude to my parents for theirlove, caring, and sacriﬁces for educating and preparing me for my future Finally,

I would like to thank my friends for their immense support and help during thisproject Without their help, completing this project would have been very difﬁcult

Trang 7

Speech synthesis technology is an essential component of current human-computerinteraction systems, which assists users in receiving the output of intelligent ma-chines more naturally and intuitively, and has thus received increasing interest inrecent years The primary use of speech synthesis or text-to-speech technology

is to translate a text into spoken speech automatically The current research cus is the deep learning-based end-to-end speech synthesis technology, which has

fo-a more powerful modeling fo-ability This report proposes fo-a novel deep lefo-arning-based speech synthesis model called FastTacotron, which can resolve some issues

learning-of previous models Experiments on the LJSpeech dataset show that FastTacotroncan converge in just a couple of hours of training using a single GPU Moreover,the proposed model can accelerate the inference process and achieve high speechquality Furthermore, our model also allows controlling the prosody of synthesizedspeech, thus can create expressive speech

Keywords: Deep learning, text-to-speech

Trang 8

Table of Contents

2.1 Machine learning 12

2.2 Deep learning 13

2.3 1D Convolutional neural networks 15

2.4 Recurrent neural networks 17

2.5 Attention 18

2.6 Transformer 20

3 Related work 23 3.1 Autoregressive models 23

3.1.1 Tacotron 24

3.1.2 Tacotron 2 25

3.1.3 ForwardTacotron 26

3.2 Non-autoregressive models 27

3.2.1 FastSpeech 28

3.2.2 FastPitch 28

3.2.3 FastSpeech 2 29

4 Proposed method 31 4.1 Pre-net and Post-net 32

4.2 LSTM 34

4.3 Variation predictor 35

4.3.1 Duration predictor 35

4.3.2 Pitch predictor 37

4.3.3 Energy predictor 37

4.3.4 Length regulator 37

4.4 Vocoder 38

Trang 9

5 Experiments and Evaluations 40

5.1 Training Setup 40

5.1.1 Dataset 40

5.1.2 Grapheme-to-phoneme converter 41

5.1.3 Model conﬁguration 44

5.2 Evaluations 44

5.2.1 Teacher model 45

5.2.2 Metrics 46

5.2.3 Results 48

Trang 10

List of Figures

1.1 Speech synthesis or Text-to-speech: the artiﬁcial production of

hu-man speech 10

1.2 Speech synthesis is used in a wide range of applications, such as assistive technology and multimedia 10

2.1 Machine learning: a new programming paradigm 12

2.2 A neural network is parameterized by its weights 14

2.3 A loss function measures the quality of the network’s output 15

2.4 The loss score is used as a feedback signal to adjust the weights 16

2.5 1D Convolution 17

2.6 Recurrent neural network 18

2.7 The encoder-decoder model with additive attention mechanism [1] 19

2.8 An alignment graph 20

2.9 Transformer model architecture [2] 21

2.10 Scaled Dot-Product Attention (left) and Multi-Head Attention (right) [2] 22

3.1 Text-to-speech process 23

3.2 Tacotron model architecture [3] 25

3.3 Tacotron 2 model architecture [4] 26

3.4 ForwardTacotron model architecture [5] 27

3.5 FastSpeech model architecture [6] 28

3.6 FastPitch model architecture [7] 29

3.7 FastSpeech 2 model architecture [8] 29

4.1 Model architecture of FastTacotron 32

4.2 The CBHG module (1-D convolution bank + highway network + bidirectional GRU) [3] This CBHG is a powerful module for ex-tracting representations from sequences 33

4.3 The LSTM cell 35

4.4 Duration/Pitch/Energy Predictor The duration, pitch, and energy pre-dictors all have a similar model structure (but different model pa-rameters) 36

4.5 Length Regulator [6] This module is used to expand the length of the phoneme sequence to match the length of mel-spectrogram sequence, as well as to control the voice speed and part of prosody 38 4.6 MelGAN model architecture [9] 38

5.1 Mel loss and Pitch loss 41

5.2 Duration loss and Energy loss 42

5.3 TransformerTTS model [10] 43

Trang 11

5.4 Standard mel-spectrogram and corresponding phonemes alignment ofsentence “It’s another hot day” in the left and sentence “I lovecats” in the right 455.5 Two time slower speed: mel-spectrogram and corresponding phonemesalignment of sentence “It’s another hot day” in the left and sen-tence “I love cats” in the right 465.6 Two time faster speed: mel-spectrogram and corresponding phonemesalignment of sentence “It’s another hot day” in the left and sen-tence “I love cats” in the right 475.7 MOS rating service interface 475.8 Ampliﬁed pitch: mel-spectrogram and corresponding phonemes align-ment of sentence “It’s another hot day” in the left and sentence “Ilove cats” in the right 485.9 Ampliﬁed energy: mel-spectrogram and corresponding phonemes align-ment of sentence “It’s another hot day” in the left and sentence “Ilove cats” in the right 49

Trang 12

List of Tables

5.1 LJSpeech dataset statistics [11] 405.2 Audio quality and inference latency comparison MelGAN [9] wasused as the vocoder MOS is a numerical measure of the human-judged the overall quality of a synthesized speech, which was judged

on a scale of 1 (bad) to 5 (excellent) RTF denotes the time (inseconds) required for the system to synthesize a one-second waveform 445.3 Training time comparison 45

Trang 13

Chapter 1

Introduction

Speech synthesis or Text-to-speech (TTS) is defined as the artificial production ofhuman speech Its primary use is to translate a text into spoken speech automati-cally (figure 1.1)

Speech Synthesis System

Figure 1.1: Speech synthesis or Text-to-speech: the artiﬁcial production of humanspeech

Speech synthesis is used in a wide range of applications However, it’s crucial

to remember that this technology was created to assist persons with impairments(particularly the visually impaired) in their daily lives For example, the very fa-mous Stephen Hawking, because of his severe impairment, relied on a TTS tocommunicate with those around him Since then, many cases of use have beendeveloped In airports, this technology is used to generate voices to transmit mes-sages to passengers Another example can be found in language translation engineslike google translate

Due to its many practical applications, speech synthesis has attracted a lot of tention for years In the past decades, concatenative synthesis with unit selection,the process of stitching small units of pre-recorded waveforms together [12, 13]was a mainstream technique Concatenation synthesis, as the name suggests, isbased on the concatenation of pre-recorded speech segments The segments can

at-Figure 1.2: Speech synthesis is used in a wide range of applications, such as assistivetechnology and multimedia

Trang 14

be whole sentences, words, syllables, diphones, or even individual phones Theyare usually stored in the form of waveforms or spectrograms The segments areacquired with the help of a speech recognition system and then are labeled based

on their acoustic properties (e.g., their fundamental frequency) At run time, thedesired sequence is created by determining the best chain of candidate units fromthe database (unit selection)

Statistical parametric speech synthesis [14–17], which directly generates smoothtrajectories of speech features to be synthesized by a vocoder, followed, solvingmany of the issues that concatenative synthesis had with boundary artifacts Para-metric synthesis utilizes recorded human voices as well The difference is that thismethod uses a function and a set of parameters to modify the voice Statisticalparametric synthesis generally has two parts: the training and the synthesis Dur-ing training, a set of parameters that characterize the audio sample, such as thefrequency spectrum (vocal tract), fundamental frequency (voice source), and dura-tion (prosody) of speech, is extracted Then those parameters are estimated using

a statistical model The one proven to provide the best results historically is theHidden Markov Model (HMM) During synthesis, HMMs generate a set of param-eters from our target text sequence The parameters are used to synthesize theﬁnal speech waveforms

However, both concatenative synthesis and statistical parametric synthesis have plex pipelines, and defining good linguistic features is often time-consuming andlanguage-specific, which requires a lot of resources and manpower Besides, syn-thesized audios often sound muffled and unnatural compared to human speech

com-Recently, deep neural network-based systems have become more and more popularfor TTS Neural network-based TTS [3, 4, 6, 10, 18, 19] has a more robust model-ing ability and a simpler pipeline Although deep neural network-based TTS hasachieved many good results so far, it remains some issues Some TTS modelshave synthetic speech quality that doesn’t really sound natural when compared tohuman voices, while others have very slow translation speed, making it unusable

in real-time applications Moreover, most models require a lot of hardware sources and time for training

re-In this thesis, I have studied the most advanced TTS methods using artiﬁcial ral networks Since then, I have proposed a new TTS method that takes advantage

neu-of the previous methods, overcoming some neu-of their disadvantages The goal neu-of theproposed TTS model is to be able to achieve both good speech quality and fastprocessing speed, as well as to reduce training time with only a small use ofhardware resources

Trang 15

Classical programming Rules

Data

Answers

Machine learning Data

Answers

Rules

Figure 2.1: Machine learning: a new programming paradigm

Rather than being explicitly programmed, a machine learning system is trained It

is presented with a large number of instances related to a task, and it detectsstatistical structure in these examples, allowing the system to develop rules forautomating the work For example, if you wanted to automate the process oftagging your vacation photos, you could provide a machine-learning system with

a large number of examples of photos that have already been tagged by humans,and the system would learn statistical rules for associating speciﬁc photos withspeciﬁc tags

Trang 16

Three things are required to perform machine learning:

• Input data points These data points could be sound ﬁles of people speaking,for example, if the task is speech recognition They could be photos if thetask is image tagging

• Examples of the expected output These could be human-generated transcripts

of sound ﬁles in a speech-recognition task In an image task, tags like

”dog,” ”cat,” and so on could be expected outputs

• A way to measure whether the algorithm is doing a good job This is required

to determine the distance between the algorithm’s current output and its pected output The measurement is utilized as a feedback signal to changethe algorithm’s behavior This process of adjustment is referred to as learn-ing

ex-A machine-learning model ”learns” how to transform its input data into ingful outputs by being exposed to known examples of inputs and outputs As

mean-a result, the key gomean-al of mmean-achine lemean-arning mean-and deep lemean-arning is to chmean-ange dmean-atmean-a

in a meaningful way: to learn useful representations of the input data at hand representations that help us move closer to the desired output

-2.2 Deep learning

Deep learning is a subﬁeld of machine learning that emphasizes learning sive layers of increasingly meaningful representations when learning representationsfrom data The deep in deep learning refers to the idea of multiple layers of rep-resentations rather than any form of deeper knowledge produced by the method.The depth of the model refers to how many layers contribute to a data model.Other appropriate names for the ﬁeld could have been layered representations learn-ing and hierarchical representations learning

succes-Tens or even hundreds of consecutive layers of representations are typically used

in modern deep learning, and they’ve all learned automatically from training data.Other techniques to machine learning, on the other hand, tend to focus on learningonly one or two layers of data representations, which is why they’re frequentlyreferred to as shallow learning These layered representations are (nearly always)learned in deep learning using neural network models, which are arranged in literallayers stacked on top of each other

Trang 17

How deep learning works The weights of a layer, which are essentially a set ofnumbers, store the deﬁnition of what a layer performs with its incoming data Intechnical terms, we can state that a layer’s transformation is parameterized by itsweights (see ﬁgure 2.2) (A layer’s weights are also known as its parameters.)Learning in this instance refers to determining a set of weights for all layers in

a network that allows the network to accurately map sample inputs to their responding targets The problem is that a deep neural network can include tens

cor-of millions cor-of parameters Finding the correct value for each cor-of them can be adifﬁcult effort, especially when changing the value of one parameter would changethe behavior of all the others

Layer (data transformation)

Predictions Y'

Weights

Input X

Figure 2.2: A neural network is parameterized by its weights

To be able to control something, you must ﬁrst be able to watch it To regulate

a neural network’s output, you must be able to measure how much it differs fromwhat you intended The network’s loss function, also known as the objective func-tion, is responsible for this The loss function computes a distance score based onthe network’s predictions and the true target (what you wanted the network to out-put) to determine how well the network performed on this speciﬁc example (seeﬁgure 2.3)

The essential idea in deep learning is to utilize this score as a feedback signal toalter the weights’ values slightly in the direction of lowering the current example’sloss score (see ﬁgure 2.4) This adjustment is the job of the optimizer, which im-plements the central algorithm in deep learning what’s called the Backpropagationalgorithm

The network’s weights are initially given random values, so it just performs a

Trang 18

se-Layer (data transformation)

ries of random transformations Naturally, its productivity falls well short of what

it should be, and the loss score reflects this However, as the network analyzesmore examples, the weights are modified somewhat in the right direction, and theloss score decreases This is the training loop, which produces weight values thatminimize the loss function when repeated a sufficient number of times (usuallytens of iterations over thousands of samples) A network with a minimal loss hasoutputs that are as close to the targets as possible: a trained network

The following paragraphs introduce 1D Convolutional neural networks, Recurrentneural networks, and Transformer, three types of deep-learning model almost uni-versally used in speech synthesis applications

2.3 1D Convolutional neural networks

The 1D convolution layers extract local 1D (1 dimension) patches from sequencetensors (e.g audio, sentence) and apply an identical transformation to every patch(see ﬁgure 2.5) Local patterns in a sequence can be recognized using such 1D

Trang 19

Weight

update

Figure 2.4: The loss score is used as a feedback signal to adjust the weights

convolution layers Because every patch goes through the same input tion, a pattern learned at one point in a sequence may be recognized at another,making 1D convolutional neural networks translation (1D CNN) invariant (for tem-poral translations)

transforma-For example, a 1D CNN processing character sequences with convolution windows

of size 5 should be able to learn and recognize words or word fragments of length

5 or fewer in any context in an input sequence Word morphology can thus belearned by a character-level 1D convnet

Mathematically, a convolution is an integration function that expresses the amount

of overlap of one function g as it is shifted over another function f ; it acts as

a blender that mixes one function with another to give reduced data space whilepreserving the information:

In terms of Deep Learning and Neural Networks, convolutions are learnable ﬁlters(matrix/vectors) that extract low-dimensional features from input data The geo-

Trang 20

Output Weights

Figure 2.5: 1D Convolution

graphic or positional links between input data points are preserved by them Byestablishing a local connection pattern between neurons of adjacent layers, con-volutional neural networks take advantage of spatially local correlation Convolu-tion is the process of applying the sliding window concept (a ﬁlter with learnableweights) to the input and creating a weighted sum (of the weights and the input)

as the output The feature space is the weighted sum, which is utilized as theinput for the successive layers

2.4 Recurrent neural networks

A recurrent neural network, also known as RNN, is a type of artiﬁcial neural work that works with time series or sequential data This deep learning algorithm

net-is commonly used for ordinal or temporal problems, such as language translation,natural language processing (nlp), speech recognition, image captioning, and speechsynthesis

A recurrent neural network (RNN) processes sequences by iterating through theelements in the sequence and keeping a state that contains information about what

it has seen so far An RNN is essentially a type of neural network with an

Trang 21

inter-nal loop (see ﬁgure 2.6) Between processing two different, independent sequences,the state of the RNN is reset, so one sequence is still considered a single datapoint: a single input to the network This data point is no longer a single step;instead, the network loops through sequence elements internally For each timestep

Unroll

Figure 2.6: Recurrent neural network

Recurrent neural networks, like feedforward and convolutional neural networks ,learn from training input They are distinguished by their ”memory”, which al-lows them to impact current input and output by using knowledge from previousinputs While typical deep neural networks presume that inputs and outputs areindependent of each other, recurrent neural networks’ output is reliant on the se-quence’s prior elements Recurrent networks are further distinguished by the factthat their parameters are shared across all layers of the network While each node

in a feedforward network has a different weight, each layer of a recurrent neuralnetwork has the same weight parameter

2.5 Attention

A critical disadvantage of the encoder-decoder RNNs/LSTMs model is the bility of remembering long sentences Often it has forgotten the ﬁrst part once itcompletes processing the whole input Another problem is that there is no way

Trang 22

incapa-to give more importance incapa-to some of the input incapa-tokens compared incapa-to others whileprocessing the input sequence The attention mechanism [1] was born to resolvethese problems (see ﬁgure 2.7).

Figure 2.7: The encoder-decoder model with additive attention mechanism [1]

the input sequence, weighted by alignment scores:

n

∑i=1

deﬁning how much of each source’s hidden state should be considered for each

feed-forward network with a single hidden layer, and this network is jointly trainedwith other parts of the model The score function is therefore in the followingform, given that tanh is used as the non-linear activation function:

The matrix of alignment scores is a nice byproduct to explicitly show the tion between source and target tokens (ﬁgure 2.8)

Trang 23

correla-Figure 2.8: An alignment graph.

2.6 Transformer

Transformer [2] is one of the best ways to capture the timely dependencies insequences Like RNN, Transformer is an architecture for converting one sequenceinto another using two parts (Encoder and Decoder), however, it differs from thepreviously existing sequence-to-sequence models in that it does not employ Re-current Networks The Transformer model is based on self-attention—the ability topay attention to distinct locations in an input sequence to calculate a representation

Q is a matrix that contains the query (vector representation of one word in the

Trang 24

Figure 2.9: Transformer model architecture [2].

sequence), K are all the keys (vector representations of all the words in the quence) and V are the values, which are again the vector representations of all thewords in the sequence For the encoder and decoder, multi-head attention modules,

se-V consists of the same word sequence as Q However, for the attention modulethat is taking into account the encoder and the decoder sequences, V is differentfrom the sequence represented by Q

Attention layers perceive their input as a collection of vectors in no particularorder There are no recurrent or convolutional layers in this model As a result,

a ”positional encoding” is used to provide information to the model about therelative positions of the tokens in the sequence (see in ﬁgure 2.9) The following

is the formula for generating positional encoding:

Trang 25

PEpos,2i+1=cos(pos/100002i/dmodel) (2.9)where pos is the position and i is the dimension.

Figure 2.10: Scaled Dot-Product Attention (left) and Multi-Head Attention (right)[2]

Trang 26

Figure 3.1: Text-to-speech process

3.1 Autoregressive models

Autoregressive models [1,2,27] are usually built on the encoder-decoder framework:

In an encoder-decoder TTS model, the encoder receives the source sequence and

Trang 27

rep-resentations and previous elements The encoder and the decoder are usually based

on RNN or Transformer model The attention mechanism [1] is also inserted

determine which source representations to focus on when predicting the currentelement, and it is a crucial component for sequence to sequence learning:

3.1.1 Tacotron

Tacotron (ﬁgure 3.2) is an end-to-end solution for TTS It is essentially a to-sequence model with the well-known encoder-decoder architecture There was

Trang 28

sequence-Figure 3.2: Tacotron model architecture [3].

also an attention mechanism used The model accepts characters as input and puts the ﬁnal speech’s raw spectrogram, which is then transformed to waveform

out-by the Grifﬁn-Lim [25] algorithm

The encoder’s purpose is to extract stable text representations in a sequential order

It takes a character sequence expressed as one-hot encoding and produces the ﬁnalrepresentation via a stack of PreNets and CHBG modules The non-linear changesused to each embedding are described by PreNet

The representation is passed to the decoder through content-based attention, with

a recurrent layer producing the attention query at each time step The query iscombined with the context vector and delivered to a GRU cell stack with residualconnections A second post-processing network, which includes a CBHG module,converts the decoder’s output to the ﬁnal waveform

3.1.2 Tacotron 2

Tacotron 2 [4] is an improved and simplified version of Tacotron [3] Instead ofPreNets and CHBG modules, the encoder now has three convolutional layers and abidirectional LSTM The basic additive attention process was enhanced by locationsensitive attention A Pre-Net, two uni-directional LSTMs, and a 5-layer Convo-lutional Post-Net now make up the decoder Following PixelCNN++ and ParallelWaveNet, a modified WaveNet is employed as a Vocoder Mel spectrograms arecreated and sent to the Vocoder rather than Linear-scale spectrograms The Griffin-

Trang 29

Lin algorithm of Tacotron 1 was replaced with the WaveNet method The modelarchitecture of Tacotron 2 is shown in ﬁgure 3.3.

Figure 3.3: Tacotron 2 model architecture [4]

3.1.3 ForwardTacotron

ForwardTacotron [5] combines ideas from the FastSpeech model with the Tacotronarchitecture (ﬁgure 3.4) It re-implemented the fundamental concept by removingthe autoregressive attentive element of the recurrent Tacotron architecture, allowing

it to estimate mels in a single forward pass

The essential module is a length regulator, which extends the phoneme embeddingsaccording to the predicted duration, which is adopted from FastSpeech Forward-Tacotron isolates the duration predictor module from the rest of the model since itimproved the mel quality Because of the absence of attention, the memory needrises linearly with sequence length, allowing the model to predict whole articles atonce Predictions are not only accurate, but also quick

Định dạng
Số trang	58
Dung lượng	0,93 MB