-2.2 Deep learning Deep learning is a subfield of machine learning that emphasizes learning sive layers of increasingly meaningful representations when learning representationsfrom data..
Trang 1HA NOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER’S THESIS
Efficient Neural Speech Synthesis
LAM XUAN THU Data Science and Artificial Intelligence
HANOI, 2021
Trang 2HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER’S THESIS
IN DATA SCIENCE AND ARTIFICIAL INTELLIGENCE Efficient Neural Speech Synthesis
LAM XUAN THU thu.lx202712m@sis.hust.edu.vn
—————————————-Department: Computer Science
HANOI, 12/2021
Trang 3CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Lâm Xuân Thư
Đề tài luận văn: Cải tiến tổng hợp tiếng nói sử dụng học sâu
Chuyên ngành: Khoa học dữ liệu và Trí tuệ nhân tạo
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 24/12/2021 với các nội dung sau:
- Thêm nội dung về đặt vấn đề và mục tiêu ở chương 1.
- Tách Related work ra khỏi chương 2 Background để tạo thành chương riêng.
- Phân tích thêm về các nghiên cứu liên quan (chương 3).
- Phân tích lý do xây dựng các thành phần của mô hình đề xuất (chương 4).
- Chi tiết hơn phần thí nghiệm (chương 5).
- Trình bày lại một số nội dung lý thuyết được tham khảo.
Ngày tháng năm 2022
CHỦ TỊCH HỘI ĐỒNG
Trang 4Declaration of Authorship and Topic
Sentences
Personal Information
Information and Communication Technology
Goals of the Thesis
• Proposing a novel neural network for Speech Synthesis
• Conducting experiments and evaluating the proposed model
Main Tasks of the Thesis
• Introduce the Speech Synthesis problem and review traditional approaches forthis problem
• Present machine learning and deep learning background for Speech Synthesisproblem
• Propose a novel neural network-based system for Speech Synthesis
• Implement experiments and evaluation
Trang 5• Conclude and outline future developments
Declaration of Authorship
I - Lam Xuan Thu - hereby warrant that the work and presentation in this thesis are performed by myself under the supervision of Dr Dinh Viet Sang
All results presented in this thesis are truthful and are not copied from any other works
Hanoi, 24th Nov 2021
Author
Lam Xuan Thu
Attestation of the Supervisor on the Fulfillment of the Requirements for the Thesis:
Hanoi, 24th Nov 2021 Supervisor
Dr Dinh Viet Sang
Trang 6I am extremely grateful to my supervisor, Dr Dinh Viet Sang, who gave me thegolden opportunity to do this wonderful project on the topic of Speech Synthesis,which also helped me in doing a lot of research and I came to know about somany new things It was a great privilege and honor to work and study underhis guidance I would also like to express my gratitude to my parents for theirlove, caring, and sacrifices for educating and preparing me for my future Finally,
I would like to thank my friends for their immense support and help during thisproject Without their help, completing this project would have been very difficult
Trang 7Speech synthesis technology is an essential component of current human-computerinteraction systems, which assists users in receiving the output of intelligent ma-chines more naturally and intuitively, and has thus received increasing interest inrecent years The primary use of speech synthesis or text-to-speech technology
is to translate a text into spoken speech automatically The current research cus is the deep learning-based end-to-end speech synthesis technology, which has
fo-a more powerful modeling fo-ability This report proposes fo-a novel deep lefo-arning-based speech synthesis model called FastTacotron, which can resolve some issues
learning-of previous models Experiments on the LJSpeech dataset show that FastTacotroncan converge in just a couple of hours of training using a single GPU Moreover,the proposed model can accelerate the inference process and achieve high speechquality Furthermore, our model also allows controlling the prosody of synthesizedspeech, thus can create expressive speech
Keywords: Deep learning, text-to-speech
Trang 8Table of Contents
2.1 Machine learning 12
2.2 Deep learning 13
2.3 1D Convolutional neural networks 15
2.4 Recurrent neural networks 17
2.5 Attention 18
2.6 Transformer 20
3 Related work 23 3.1 Autoregressive models 23
3.1.1 Tacotron 24
3.1.2 Tacotron 2 25
3.1.3 ForwardTacotron 26
3.2 Non-autoregressive models 27
3.2.1 FastSpeech 28
3.2.2 FastPitch 28
3.2.3 FastSpeech 2 29
4 Proposed method 31 4.1 Pre-net and Post-net 32
4.2 LSTM 34
4.3 Variation predictor 35
4.3.1 Duration predictor 35
4.3.2 Pitch predictor 37
4.3.3 Energy predictor 37
4.3.4 Length regulator 37
4.4 Vocoder 38
Trang 95 Experiments and Evaluations 40
5.1 Training Setup 40
5.1.1 Dataset 40
5.1.2 Grapheme-to-phoneme converter 41
5.1.3 Model configuration 44
5.2 Evaluations 44
5.2.1 Teacher model 45
5.2.2 Metrics 46
5.2.3 Results 48
Trang 10List of Figures
1.1 Speech synthesis or Text-to-speech: the artificial production of
hu-man speech 10
1.2 Speech synthesis is used in a wide range of applications, such as assistive technology and multimedia 10
2.1 Machine learning: a new programming paradigm 12
2.2 A neural network is parameterized by its weights 14
2.3 A loss function measures the quality of the network’s output 15
2.4 The loss score is used as a feedback signal to adjust the weights 16
2.5 1D Convolution 17
2.6 Recurrent neural network 18
2.7 The encoder-decoder model with additive attention mechanism [1] 19
2.8 An alignment graph 20
2.9 Transformer model architecture [2] 21
2.10 Scaled Dot-Product Attention (left) and Multi-Head Attention (right) [2] 22
3.1 Text-to-speech process 23
3.2 Tacotron model architecture [3] 25
3.3 Tacotron 2 model architecture [4] 26
3.4 ForwardTacotron model architecture [5] 27
3.5 FastSpeech model architecture [6] 28
3.6 FastPitch model architecture [7] 29
3.7 FastSpeech 2 model architecture [8] 29
4.1 Model architecture of FastTacotron 32
4.2 The CBHG module (1-D convolution bank + highway network + bidirectional GRU) [3] This CBHG is a powerful module for ex-tracting representations from sequences 33
4.3 The LSTM cell 35
4.4 Duration/Pitch/Energy Predictor The duration, pitch, and energy pre-dictors all have a similar model structure (but different model pa-rameters) 36
4.5 Length Regulator [6] This module is used to expand the length of the phoneme sequence to match the length of mel-spectrogram sequence, as well as to control the voice speed and part of prosody 38 4.6 MelGAN model architecture [9] 38
5.1 Mel loss and Pitch loss 41
5.2 Duration loss and Energy loss 42
5.3 TransformerTTS model [10] 43
Trang 115.4 Standard mel-spectrogram and corresponding phonemes alignment ofsentence “It’s another hot day” in the left and sentence “I lovecats” in the right 455.5 Two time slower speed: mel-spectrogram and corresponding phonemesalignment of sentence “It’s another hot day” in the left and sen-tence “I love cats” in the right 465.6 Two time faster speed: mel-spectrogram and corresponding phonemesalignment of sentence “It’s another hot day” in the left and sen-tence “I love cats” in the right 475.7 MOS rating service interface 475.8 Amplified pitch: mel-spectrogram and corresponding phonemes align-ment of sentence “It’s another hot day” in the left and sentence “Ilove cats” in the right 485.9 Amplified energy: mel-spectrogram and corresponding phonemes align-ment of sentence “It’s another hot day” in the left and sentence “Ilove cats” in the right 49
Trang 12List of Tables
5.1 LJSpeech dataset statistics [11] 405.2 Audio quality and inference latency comparison MelGAN [9] wasused as the vocoder MOS is a numerical measure of the human-judged the overall quality of a synthesized speech, which was judged
on a scale of 1 (bad) to 5 (excellent) RTF denotes the time (inseconds) required for the system to synthesize a one-second waveform 445.3 Training time comparison 45
Trang 13Chapter 1
Introduction
Speech synthesis or Text-to-speech (TTS) is defined as the artificial production ofhuman speech Its primary use is to translate a text into spoken speech automati-cally (figure 1.1)
Speech Synthesis System
Figure 1.1: Speech synthesis or Text-to-speech: the artificial production of humanspeech
Speech synthesis is used in a wide range of applications However, it’s crucial
to remember that this technology was created to assist persons with impairments(particularly the visually impaired) in their daily lives For example, the very fa-mous Stephen Hawking, because of his severe impairment, relied on a TTS tocommunicate with those around him Since then, many cases of use have beendeveloped In airports, this technology is used to generate voices to transmit mes-sages to passengers Another example can be found in language translation engineslike google translate
Due to its many practical applications, speech synthesis has attracted a lot of tention for years In the past decades, concatenative synthesis with unit selection,the process of stitching small units of pre-recorded waveforms together [12, 13]was a mainstream technique Concatenation synthesis, as the name suggests, isbased on the concatenation of pre-recorded speech segments The segments can
at-Figure 1.2: Speech synthesis is used in a wide range of applications, such as assistivetechnology and multimedia
Trang 14be whole sentences, words, syllables, diphones, or even individual phones Theyare usually stored in the form of waveforms or spectrograms The segments areacquired with the help of a speech recognition system and then are labeled based
on their acoustic properties (e.g., their fundamental frequency) At run time, thedesired sequence is created by determining the best chain of candidate units fromthe database (unit selection)
Statistical parametric speech synthesis [14–17], which directly generates smoothtrajectories of speech features to be synthesized by a vocoder, followed, solvingmany of the issues that concatenative synthesis had with boundary artifacts Para-metric synthesis utilizes recorded human voices as well The difference is that thismethod uses a function and a set of parameters to modify the voice Statisticalparametric synthesis generally has two parts: the training and the synthesis Dur-ing training, a set of parameters that characterize the audio sample, such as thefrequency spectrum (vocal tract), fundamental frequency (voice source), and dura-tion (prosody) of speech, is extracted Then those parameters are estimated using
a statistical model The one proven to provide the best results historically is theHidden Markov Model (HMM) During synthesis, HMMs generate a set of param-eters from our target text sequence The parameters are used to synthesize thefinal speech waveforms
However, both concatenative synthesis and statistical parametric synthesis have plex pipelines, and defining good linguistic features is often time-consuming andlanguage-specific, which requires a lot of resources and manpower Besides, syn-thesized audios often sound muffled and unnatural compared to human speech
com-Recently, deep neural network-based systems have become more and more popularfor TTS Neural network-based TTS [3, 4, 6, 10, 18, 19] has a more robust model-ing ability and a simpler pipeline Although deep neural network-based TTS hasachieved many good results so far, it remains some issues Some TTS modelshave synthetic speech quality that doesn’t really sound natural when compared tohuman voices, while others have very slow translation speed, making it unusable
in real-time applications Moreover, most models require a lot of hardware sources and time for training
re-In this thesis, I have studied the most advanced TTS methods using artificial ral networks Since then, I have proposed a new TTS method that takes advantage
neu-of the previous methods, overcoming some neu-of their disadvantages The goal neu-of theproposed TTS model is to be able to achieve both good speech quality and fastprocessing speed, as well as to reduce training time with only a small use ofhardware resources
Trang 15Classical programming Rules
Data
Answers
Machine learning Data
Answers
Rules
Figure 2.1: Machine learning: a new programming paradigm
Rather than being explicitly programmed, a machine learning system is trained It
is presented with a large number of instances related to a task, and it detectsstatistical structure in these examples, allowing the system to develop rules forautomating the work For example, if you wanted to automate the process oftagging your vacation photos, you could provide a machine-learning system with
a large number of examples of photos that have already been tagged by humans,and the system would learn statistical rules for associating specific photos withspecific tags
Trang 16Three things are required to perform machine learning:
• Input data points These data points could be sound files of people speaking,for example, if the task is speech recognition They could be photos if thetask is image tagging
• Examples of the expected output These could be human-generated transcripts
of sound files in a speech-recognition task In an image task, tags like
”dog,” ”cat,” and so on could be expected outputs
• A way to measure whether the algorithm is doing a good job This is required
to determine the distance between the algorithm’s current output and its pected output The measurement is utilized as a feedback signal to changethe algorithm’s behavior This process of adjustment is referred to as learn-ing
ex-A machine-learning model ”learns” how to transform its input data into ingful outputs by being exposed to known examples of inputs and outputs As
mean-a result, the key gomean-al of mmean-achine lemean-arning mean-and deep lemean-arning is to chmean-ange dmean-atmean-a
in a meaningful way: to learn useful representations of the input data at hand representations that help us move closer to the desired output
-2.2 Deep learning
Deep learning is a subfield of machine learning that emphasizes learning sive layers of increasingly meaningful representations when learning representationsfrom data The deep in deep learning refers to the idea of multiple layers of rep-resentations rather than any form of deeper knowledge produced by the method.The depth of the model refers to how many layers contribute to a data model.Other appropriate names for the field could have been layered representations learn-ing and hierarchical representations learning
succes-Tens or even hundreds of consecutive layers of representations are typically used
in modern deep learning, and they’ve all learned automatically from training data.Other techniques to machine learning, on the other hand, tend to focus on learningonly one or two layers of data representations, which is why they’re frequentlyreferred to as shallow learning These layered representations are (nearly always)learned in deep learning using neural network models, which are arranged in literallayers stacked on top of each other
Trang 17How deep learning works The weights of a layer, which are essentially a set ofnumbers, store the definition of what a layer performs with its incoming data Intechnical terms, we can state that a layer’s transformation is parameterized by itsweights (see figure 2.2) (A layer’s weights are also known as its parameters.)Learning in this instance refers to determining a set of weights for all layers in
a network that allows the network to accurately map sample inputs to their responding targets The problem is that a deep neural network can include tens
cor-of millions cor-of parameters Finding the correct value for each cor-of them can be adifficult effort, especially when changing the value of one parameter would changethe behavior of all the others
Layer (data transformation)
Layer (data transformation)
Predictions Y'
Weights
Weights
Input X
Figure 2.2: A neural network is parameterized by its weights
To be able to control something, you must first be able to watch it To regulate
a neural network’s output, you must be able to measure how much it differs fromwhat you intended The network’s loss function, also known as the objective func-tion, is responsible for this The loss function computes a distance score based onthe network’s predictions and the true target (what you wanted the network to out-put) to determine how well the network performed on this specific example (seefigure 2.3)
The essential idea in deep learning is to utilize this score as a feedback signal toalter the weights’ values slightly in the direction of lowering the current example’sloss score (see figure 2.4) This adjustment is the job of the optimizer, which im-plements the central algorithm in deep learning what’s called the Backpropagationalgorithm
The network’s weights are initially given random values, so it just performs a
Trang 18se-Layer (data transformation)
Layer (data transformation)
ries of random transformations Naturally, its productivity falls well short of what
it should be, and the loss score reflects this However, as the network analyzesmore examples, the weights are modified somewhat in the right direction, and theloss score decreases This is the training loop, which produces weight values thatminimize the loss function when repeated a sufficient number of times (usuallytens of iterations over thousands of samples) A network with a minimal loss hasoutputs that are as close to the targets as possible: a trained network
The following paragraphs introduce 1D Convolutional neural networks, Recurrentneural networks, and Transformer, three types of deep-learning model almost uni-versally used in speech synthesis applications
2.3 1D Convolutional neural networks
The 1D convolution layers extract local 1D (1 dimension) patches from sequencetensors (e.g audio, sentence) and apply an identical transformation to every patch(see figure 2.5) Local patterns in a sequence can be recognized using such 1D
Trang 19Layer (data transformation)
Layer (data transformation)
Weight
update
Figure 2.4: The loss score is used as a feedback signal to adjust the weights
convolution layers Because every patch goes through the same input tion, a pattern learned at one point in a sequence may be recognized at another,making 1D convolutional neural networks translation (1D CNN) invariant (for tem-poral translations)
transforma-For example, a 1D CNN processing character sequences with convolution windows
of size 5 should be able to learn and recognize words or word fragments of length
5 or fewer in any context in an input sequence Word morphology can thus belearned by a character-level 1D convnet
Mathematically, a convolution is an integration function that expresses the amount
of overlap of one function g as it is shifted over another function f ; it acts as
a blender that mixes one function with another to give reduced data space whilepreserving the information:
In terms of Deep Learning and Neural Networks, convolutions are learnable filters(matrix/vectors) that extract low-dimensional features from input data The geo-
Trang 20Output Weights
Figure 2.5: 1D Convolution
graphic or positional links between input data points are preserved by them Byestablishing a local connection pattern between neurons of adjacent layers, con-volutional neural networks take advantage of spatially local correlation Convolu-tion is the process of applying the sliding window concept (a filter with learnableweights) to the input and creating a weighted sum (of the weights and the input)
as the output The feature space is the weighted sum, which is utilized as theinput for the successive layers
2.4 Recurrent neural networks
A recurrent neural network, also known as RNN, is a type of artificial neural work that works with time series or sequential data This deep learning algorithm
net-is commonly used for ordinal or temporal problems, such as language translation,natural language processing (nlp), speech recognition, image captioning, and speechsynthesis
A recurrent neural network (RNN) processes sequences by iterating through theelements in the sequence and keeping a state that contains information about what
it has seen so far An RNN is essentially a type of neural network with an
Trang 21inter-nal loop (see figure 2.6) Between processing two different, independent sequences,the state of the RNN is reset, so one sequence is still considered a single datapoint: a single input to the network This data point is no longer a single step;instead, the network loops through sequence elements internally For each timestep
Unroll
Figure 2.6: Recurrent neural network
Recurrent neural networks, like feedforward and convolutional neural networks ,learn from training input They are distinguished by their ”memory”, which al-lows them to impact current input and output by using knowledge from previousinputs While typical deep neural networks presume that inputs and outputs areindependent of each other, recurrent neural networks’ output is reliant on the se-quence’s prior elements Recurrent networks are further distinguished by the factthat their parameters are shared across all layers of the network While each node
in a feedforward network has a different weight, each layer of a recurrent neuralnetwork has the same weight parameter
2.5 Attention
A critical disadvantage of the encoder-decoder RNNs/LSTMs model is the bility of remembering long sentences Often it has forgotten the first part once itcompletes processing the whole input Another problem is that there is no way
Trang 22incapa-to give more importance incapa-to some of the input incapa-tokens compared incapa-to others whileprocessing the input sequence The attention mechanism [1] was born to resolvethese problems (see figure 2.7).
Figure 2.7: The encoder-decoder model with additive attention mechanism [1]
the input sequence, weighted by alignment scores:
n
∑i=1
defining how much of each source’s hidden state should be considered for each
feed-forward network with a single hidden layer, and this network is jointly trainedwith other parts of the model The score function is therefore in the followingform, given that tanh is used as the non-linear activation function:
The matrix of alignment scores is a nice byproduct to explicitly show the tion between source and target tokens (figure 2.8)
Trang 23correla-Figure 2.8: An alignment graph.
2.6 Transformer
Transformer [2] is one of the best ways to capture the timely dependencies insequences Like RNN, Transformer is an architecture for converting one sequenceinto another using two parts (Encoder and Decoder), however, it differs from thepreviously existing sequence-to-sequence models in that it does not employ Re-current Networks The Transformer model is based on self-attention—the ability topay attention to distinct locations in an input sequence to calculate a representation
Q is a matrix that contains the query (vector representation of one word in the
Trang 24Figure 2.9: Transformer model architecture [2].
sequence), K are all the keys (vector representations of all the words in the quence) and V are the values, which are again the vector representations of all thewords in the sequence For the encoder and decoder, multi-head attention modules,
se-V consists of the same word sequence as Q However, for the attention modulethat is taking into account the encoder and the decoder sequences, V is differentfrom the sequence represented by Q
Attention layers perceive their input as a collection of vectors in no particularorder There are no recurrent or convolutional layers in this model As a result,
a ”positional encoding” is used to provide information to the model about therelative positions of the tokens in the sequence (see in figure 2.9) The following
is the formula for generating positional encoding:
Trang 25PEpos,2i+1=cos(pos/100002i/dmodel) (2.9)where pos is the position and i is the dimension.
Figure 2.10: Scaled Dot-Product Attention (left) and Multi-Head Attention (right)[2]
Trang 26Figure 3.1: Text-to-speech process
3.1 Autoregressive models
Autoregressive models [1,2,27] are usually built on the encoder-decoder framework:
In an encoder-decoder TTS model, the encoder receives the source sequence and
Trang 27rep-resentations and previous elements The encoder and the decoder are usually based
on RNN or Transformer model The attention mechanism [1] is also inserted
determine which source representations to focus on when predicting the currentelement, and it is a crucial component for sequence to sequence learning:
3.1.1 Tacotron
Tacotron (figure 3.2) is an end-to-end solution for TTS It is essentially a to-sequence model with the well-known encoder-decoder architecture There was
Trang 28sequence-Figure 3.2: Tacotron model architecture [3].
also an attention mechanism used The model accepts characters as input and puts the final speech’s raw spectrogram, which is then transformed to waveform
out-by the Griffin-Lim [25] algorithm
The encoder’s purpose is to extract stable text representations in a sequential order
It takes a character sequence expressed as one-hot encoding and produces the finalrepresentation via a stack of PreNets and CHBG modules The non-linear changesused to each embedding are described by PreNet
The representation is passed to the decoder through content-based attention, with
a recurrent layer producing the attention query at each time step The query iscombined with the context vector and delivered to a GRU cell stack with residualconnections A second post-processing network, which includes a CBHG module,converts the decoder’s output to the final waveform
3.1.2 Tacotron 2
Tacotron 2 [4] is an improved and simplified version of Tacotron [3] Instead ofPreNets and CHBG modules, the encoder now has three convolutional layers and abidirectional LSTM The basic additive attention process was enhanced by locationsensitive attention A Pre-Net, two uni-directional LSTMs, and a 5-layer Convo-lutional Post-Net now make up the decoder Following PixelCNN++ and ParallelWaveNet, a modified WaveNet is employed as a Vocoder Mel spectrograms arecreated and sent to the Vocoder rather than Linear-scale spectrograms The Griffin-
Trang 29Lin algorithm of Tacotron 1 was replaced with the WaveNet method The modelarchitecture of Tacotron 2 is shown in figure 3.3.
Figure 3.3: Tacotron 2 model architecture [4]
3.1.3 ForwardTacotron
ForwardTacotron [5] combines ideas from the FastSpeech model with the Tacotronarchitecture (figure 3.4) It re-implemented the fundamental concept by removingthe autoregressive attentive element of the recurrent Tacotron architecture, allowing
it to estimate mels in a single forward pass
The essential module is a length regulator, which extends the phoneme embeddingsaccording to the predicted duration, which is adopted from FastSpeech Forward-Tacotron isolates the duration predictor module from the rest of the model since itimproved the mel quality Because of the absence of attention, the memory needrises linearly with sequence length, allowing the model to predict whole articles atonce Predictions are not only accurate, but also quick