With this corpus, we can synthesize almost all syllables of Vietnamese and can apply Text-to-speech system to any Vietnamese documents After that, based on researches about Vietnamese ph
Trang 2MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
- Nguyen Tien Thanh
VIETNAMESE SPEECH SYNTHESIS FOR SOME ASSISTANT SERVICES ON MOBILE DEVICES
Department : International research institute MICA
MASTER THESIS OF SCIENCE COMPUTER SCIENCE
SUPERVISOR:
Dr Mac Dang Khoa
Hanoi – 2016
Trang 3I commit myself to be the person who was responsible for conducting this study All reference figures were extracted with clear derivation The presented results are truthful and have not published in any other person‟s work
Nguyễn Tiến Thành
Trang 4Special thanks to my supervisor Dr Mạc Đăng Khoa and colleagues of Speech Communication Department, MICA Institute for their advice and encouragement they gave to me, especially Assoc Prof Trần Đỗ Đạt for their thorough review and invaluable suggestions
I would like to thank to Mr Nguyễn Mạnh Hà and Ms Nguyễn Hằng Phương for their guide in recording the corpus I would also like to thank to a lot of MICA members, who spent much of time for testing for my research
I am grateful to Prof Eric Castelli, Dr Nguyễn Việt Sơn and MICA‟s directorate for supporting me the best working conditions in MICA International Research Institute
Finally, I owe a great deal to my parents and my younger brother for their encouragement and support They have given me strength and motivation in my work and in my life
Nguyễn Tiến Thành
Trang 5List of figures
Figure 1-1 Representation of sound.(Huang et al 2001) 4
Figure 1-2 A schematic diagram of the human speech production apparatus (Huang et al 2001) 6
Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner and Juang 1993) 7
Figure 1-4 Waveform plot of the beginning of the utterance “It‟s time”(Huang et al 2001) 8
Figure 1-5 Signal of sound “my speech” and its spectrogram 9
Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012) 10
Figure 1-7 Schematic of text-to-speech synthesis 11
Figure 1-8 A schematic of the construction of an articulatory speech synthesizer and how a such a synthesizer may be considered to contain a model of information encoding in the speech signal (Palo 2006) 14
Figure 1-9 Block diagram of a synthesis-by-rule system Pitch and formants are listed as the only parameters of the synthesizer for convenience In practice, such system has about 40 parameters (Huang et al 2001) 15
Figure 1-10 Core architecture of HMM-based speech synthesis system (Yoshimura 2002) 18
Figure 1-11 General HMM-based synthesis scheme (Zen et al 2009) 19
Figure 1-12 A diagram of the Hunt and Black algorithm, showing one particular sequence of units and how the target cost measures a distance between a unit and the specification, and how the join cost measures a distance between the two adjacent units (Taylor 2009) 25
Figure 2-1 Schematic diagram of Hanoi Vietnamese tones (Michaud 2004) 35
Figure 2-2 Base system of Vu Hui Quan consists of 2 parts: training part and synthesis part.(Quan and Nam 2009) 36
Figure 2-3 Vietnamese speech recognition system (Vu et al 2006) 37
Figure 2-4 Non-uniform unit selection model (Van Do et al 2011) 38
Trang 6Figure 2-5 Parse tree to search (Van Do et al 2011) 39
Figure 3-1 Target cost of target units and candidate units (Tran 2007) 42
Figure 3-2 Sentence splits into phrases and syllables 44
Figure 3-3 Average length of syllables in different positions (Tran 2007) 45
Figure 3-4 Average length of syllables (Tran 2007) 46
Figure 3-5 Signal of “giỏi” syllable in two difference positions 47
Figure 3-6 Sub-cost based on the difference in position of phrase 49
Figure 3-7 Sub-cost based on the difference in context of preceding syllable and following syllable 50
Figure 3-8 Syllable “Quanh” is composed of four phonemes 51
Figure 3-9 Sub-cost based on the difference in context of preceding phoneme and following phoneme 51
Figure 3-10 Acoustic units network 56
Figure 3-11 The algorithm of separating sentence into as long as possible phrases 57 Figure 3-12 Finding the longest phrase in database 58
Figure 3-13 Search space before applying acoustic units network 59
Figure 3-14 Search space after applying acoustic units network 60
Figure 3-15 Finding candidates of word “chúng tôi” 61
Figure 4-1 Interface of Adobe Audition 3.0 65
Figure 4-2 Interface of Praat 66
Figure 4-3 Most test result by domain 68
Figure 4-4 Perception test 69
Figure 4-5 Result of the perception test 70
Figure 4-6 Speed of synthesis process of two systems 72
Trang 7List of tables
Table 1-1 Types of using some popular units 29
Table 2-1 The concluded structure of Vietnamese syllables (Tran 2003) 33
Table 2-2 Symbol of Vietnamese tones 34
Table 2-3 Advantages and disadvantages between two synthesis systems of Quan and Thao 40
Table 3-1 Position difference and cost value (min is better) Target unit is begin or end of sentence 48
Table 3-2 Position difference and cost value (min is better) Target unit is both begin and end or is middle of sentence 48
Table 3-3 Phoneme types in Vietnamese (Tran 2007) 52
Table 3-4 Direction and complexity of Vietnamese tones 54
Table 4-1 Number of sentences and distinct syllables in each domain 63
Table 4-2 Tags and Meaning of xml file 67
Trang 8Contents
COMMITMENT iii
ACKNOWLEDGEMENT iv
List of figures v
List of tables vii
Introduction 1
Chapter 1 Overview of speech processing and text-to-speech 4
1.1 Speech and speech processing 4
1.1.1 Sound 4
1.1.2 Human vocal mechanism 5
1.1.3 Speech representation in the time and frequency domains 7
1.1.4 Speech processing 10
1.2 Text-To-Speech 11
1.2.1 Introduction 11
1.2.2 Speech synthesis techniques 12
1.2.3 Articulatory synthesis 13
1.2.4 Formant synthesis 15
1.2.5 Concatenative synthesis 16
1.2.6 Statistical Parametric synthesis 17
1.3 From concatenative synthesis to unit selection synthesis 21
1.3.1 Extending concatenative synthesis 21
1.3.2 The algorithm of Hunt and Black 24
1.3.3 Speech synthesis based on non-uniform units selection 27
1.4 Conclusion 30
Trang 9Chapter 2 Text-to-speech for Vietnamese 31
2.1 Overview Vietnamese language and phonology 31
2.1.1 Characteristics 31
2.1.2 Vietnamese syllable structure 33
2.2 Overview text-to-speech in Vietnamese 35
2.3 Discussion and proposal 39
2.4 Conclusion 41
Chapter 3 Improvement of Non-uniform unit selection technique for Vietnamese Text-to-speech 42
3.1 Quality improvement: using target costs for unit selection 42
3.1.1 Target costs in Vietnamese synthesis 42
3.1.2 Separating sentence into phrases 43
3.1.3 Target cost computation 44
3.2 Performance improvement: using acoustic units network 55
3.2.1 Acoustic units network 55
3.2.2 Separating sentence into the longest phrases 56
3.2.3 Searching candidates 59
3.3 Conclusion 61
Chapter 4 Implementations and evaluation 62
4.1 System overview 62
4.2 Building database 62
4.2.1 Text database building 62
4.2.2 Speech corpus recording 64
4.2.3 Database processing 64
Trang 104.3 Evaluation 67
4.3.1 Quality of synthesized speech 67
4.3.2 Cost target improvement 69
4.3.3 Performance 71
4.4 Conclusion 73
Chapter 5 Conclusions and perspectives 74
References 76
Trang 11Introduction
Context
Most people have heard about some synthetic voices in their life We experienced them in a number of situations For instances, some telephone information systems have automated speech response, or speech synthesis is often used as an aid to the disabled
Text-to-speech (TTS) systems have been integrated in many applications One
of the useful applications is reading for blind people application, which can read any text from a book and convert it into speech Being known as Talkback, this kind
of application has been developed and integrated by Google on Android OS Talkback can read text displayed on the screens of Android devices to help blind people use these devices easily
The mainstream adoption of TTS has been severely limited by its quality In recent years, the considerable advance in their quality have made TTS systems are becoming more common Probably the main use of TTS today is in call-centre automation, where a user calls to pay an electricity bill or book some travel and conducts the entire transaction through an automatic dialogue system Beyond this, TTS systems have been used for reading news stories, weather reports, travel directions and a wide variety of other applications
In recent times, smart devices such as smartphones, tablets, etc are increasingly popular and play an important role in our life They can be used in education, medical, transport, communication, and so on In Vietnam, some TTS systems have been studied and developed on the mobile devices, such as : vnSpeak, Viettel Speak, etc At MICA international research institute, researchers have also developed some TTS systems integrated into numbers of applications such as VIVA, VIVAVU, VIQ on Google Play However, these systems still exist some limitations such as poor voice quality, long response time, etc
Trang 12Our goals in building a system that is capable of speaking from text can be applied to smart devices and overcome the mentioned weaknesses We hope that this system will bring advantages in life for us
Objective of this thesis
This thesis was realized at MICA institute, Speech Communication department and its main goal is to build a high quality Vietnamese speech synthesis system that can be integrated into electronic devices running on Android OS
Basic theory of speech synthesis is firstly studied Then, new methods to improve quality of the existed Vietnamese synthesis system, that is driven to run on smartphones and smart devices, will be proposed
The first task is building a speech corpus for synthesizing Vietnamese utterances With this corpus, we can synthesize almost all syllables of Vietnamese and can apply Text-to-speech system to any Vietnamese documents
After that, based on researches about Vietnamese phonetic and Vietnamese synthesis, some new costs for calculating optimal way in speech synthesis using unit selection technique were proposed The costs are expected help us choose more preferable units to synthesize utterance
Moreover, we also suggest using a phonetic units network to optimize searching and selecting time of candidate units
Finally, all these researches and suggestions will be applied to a speech synthesis system that can be embedded in assistant applications on smartphones
Thesis structure
Chapter 1 presents basic theories of speech, giving the background of speech signal, speech signal processing and speech synthesis
Trang 13Chapter 2 focuses on theories of speech synthesis using unit selection technique It also introduces current Vietnamese speech synthesis and gives suggestions
Chapter 3 is our research on target cost used for selecting units in speech synthesis We also describe an acoustic unit network which is used for improving performance of the TTS system
In chapter 4, our work on building the Vietnamese speech corpus is presented Experiments for evaluating the quality of the new TTS system are also presented Final part completes with conclusions of the thesis work and suggestions for further work
Trang 14Chapter 1 Overview of speech processing and speech
text-to-1.1 Speech and speech processing
In this section, we briefly review speech sound and human speech production systems We also show how speech signal can be represented
1.1.1 Sound
Sound is a longitudinal pressure wave formed of compressions and rarefactions
of air molecules, in a direction parallel to that of the application of energy Compressions are zones where air molecules have been forced by the application of energy into a tighter-than-usual configuration, and rarefactions are zones where air molecules are less tightly packed
The alternating configurations of compression and rarefaction of air molecules along the path of an energy source are sometimes described by the graph of a sine wave as shown in Figure 1-1
Figure 1-1 Representation of sound.(Huang et al 2001)
In this representation, crests of the sine curve correspond to moments of maximal compression and troughs to moments of maximal rarefaction There are two important parameters, amplitude and wavelength, to describe a sine wave Frequency (calculated by cycles/second) measured in Hertz (Hz) is also used to measure of the waveform
Trang 151.1.2 Human vocal mechanism
A schematic diagram of the human vocal mechanism is shown in Figure 1-2 The gross components of the speech production apparatus are the lungs, trachea, larynx (organ of voice production), pharyngeal cavity (throat), oral and nasal cavity The pharyngeal and oral cavities are typically referred to as the vocal tract, and the nasal cavity as the illustrated in Figure 1-2, the human speech production apparatus consists of:
- Lungs: source of air during speech
- Vocal cords (larynx): when the vocal folds are held close together and oscillate against one another during a speech sound, the sound is said to be voiced
When the folds are too slack or tense to vibrate periodically, the sound is said to be
unvoiced The place where the vocal folds come together is called the glottis
- Velum (Soft Palate): operates as a valve, opening to allow passage of air (and
thus resonance) through the nasal cavity Sounds produced with the flap open
include m and n
- Hard palate: a long relatively hard surface at the roof inside the mouth, which,
when the tongue is placed against it, enables consonant articulation
- Tongue: flexible articulator, shaped away from the palate for vowels, placed
close to or on the palate or other hard surfaces for consonant articulation
- Teeth: another place of articulation used to brace the tongue for certain
consonants
- Lips: can be rounded or spread to affect vowel quality, and closed completely
to stop the oral air flow in certain consonants (p, b, m)
Trang 16Figure 1-2 A schematic diagram of the human speech production apparatus (Huang et al
2001)
Air enters the lungs via the normal breathing mechanism As air is expelled from the lung to the trachea (or windpipe), the tensed vocal cords within the larynx are caused to vibrate (in the mode of relaxation oscillator) by the air flow The air flow is chopped in to quasi-periodic pulses which are the modulated in frequency in passing through the pharynx (the throat cavity), the mouth cavity, and possibly the nasal cavity Depend on the positions of the various articulators (i.e jaw, tongue, velum, lips, mouth) different sounds are produced
The glottal air flow (volume velocity wave form) and the resulting sound pressure at the mouth for a typical vowel sound is shown in Figure 1-3 The glottal waveform shows a gradual build-up to a quasi-periodic pulse train of air, taking about 15 ms to reach steady state This build-up is also reflected in the acoustic waveform shown at the bottom of the figure
Trang 17Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner and
Juang 1993)
Speech is produced as a sequence of sounds Hence, the state of the vocal cords,
as well as the positions, shape, and sizes of the various articulators, changes over time to reflect the sound being produced
1.1.3 Speech representation in the time and frequency domains
In general, we have three ways to represent a speech signal Firstly, we know that the speech signal is slowly time varying signal, thus when it is examined over a sufficiently short period of time, its characteristics are fairly stationary; however, over long periods of time the signal characteristics change to reflect the different speech sounds being spoken
There are several ways of labeling events in speech One of the simply and most straightforward is via the state of the speech-production source- the vocal cords We use a three-state representation, which includes :
- Silence (S), where no speech is produced
- Unvoiced (U), in which the vocal cords not vibrating, so the resulting speech waveform is aperiodic or random in nature
Trang 18- Voiced (V), in which the vocal cords are tensed and therefore vibrate periodically when flow from the lungs, so the resulting speech waveform is quasi-periodic
It should be clear that the segmentation of the waveform into well-defined regions of silence, unvoiced, and voiced signal is not exact; it is often difficult to distinguish a weak, unvoiced sound from the silence, or a weak, voiced sound from unvoiced sounds or even silence
Figure 1-4 Waveform plot of the beginning of the utterance “It’s time”(Huang et al
2001)
An alternative way of characterizing the speech signal and representing the information associated with the sounds is via a spectral representation Perhaps, the most popular representation of this type is the sound spectrogram in which a three-dimensional representation of the speech intensity, in different frequency bands, over time is portrayed Figure 1-5 shows an example of the speech presentation by spectrogram In this figure, the spectral intensity at each point in time is indicated
by the intensity (darkness) of the plot at a particular analysis frequency
A third way of representing the time varying signal characteristics of speech is via a parameterization of the spectral activity based on the model of speech production Because of the human vocal tract is essentially a tube, or concatenation
Trang 19of tubes, of varying cross-sectional area that is excited either at one end (by the vocal cord puffs of air) or at a point along the tube (corresponding to turbulent air at
a constriction), acoustic theory tells us that the transfer function of energy from the excitation source to the output can be described in term of the natural frequencies or resonances of the tube
Such resonances are called formants for speech, and they represent the frequencies that pass the most acoustic energy from source to the output Typically, there are about three resonance of significance, for a human vocal tract, below about 3500Hz There is a good correspondence between the estimated formant frequencies and the points of high spectral energy in spectrogram The formant frequency representation is a highly efficient, compact representation of the time varying characteristics of speech The major problem, however, is the difficulty of reliably estimating the formant frequencies for low-level voiced sound, and the difficulty of defining the formant for unvoiced or silent regions As such, this representation is more of theoretical than practical interest
Figure 1-5 Signal of sound “my speech” and its spectrogram
Trang 201.1.4 Speech processing
Speech processing brings a growing number of language processing applications We already saw examples in the form of real-time dialogue between a user and a machine: voice-activated telephone servers, embedded conversational agents to control devices, i.e., jukeboxes, VCRs, and so on In such systems, a speech recognition module transcribes the user‟s speech into a word stream The character flow is then processed by a language engine dealing with syntax, semantics, and finally by the back-end application program
Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012)
A speech synthesizer converts resulting answers (strings of characters) into speech to the user Figure 1-6 shows how speech processing is located within a language processing architecture, here to be a natural language interface to a database
Speech recognition is also an application in itself, as with speech dictation systems Such systems enable users to transcribe speech into written reports or documents, without the help of a keyboard Most speech dictation systems have no other module than the speech engine and a statistical language model They do not include further syntactic or semantic layers
Within the scope of this thesis, we focus on learning and researching on speech synthesis
Trang 211.2 Text-To-Speech
1.2.1 Introduction
This field of study is known both as speech synthesis that is the “synthetic” (computer) generation of speech, and text-to-speech or TTS; the process of converting written text into speech
Text-to-speech systems have an enormous range of applications Their first real application was in reading systems for the blind, where a system would read some text from a book and convert it into speech Today, quite sophisticated systems exist that facilitate human computer interaction for the blind, in which the TTS can help the user navigate around a windows system.(Taylor 2009)
Figure 1-7 Schematic of text-to-speech synthesis
As seen in the picture the synthesis starts from text input Nowadays this may
be plain text or marked-up text e.g HTML or something similar If the text uses some sort of mark-up it may already contain some or all of the information made available by the text and linguistic analysis stage Regardless of the quality of the input text, after this stage we will have a description of the text on the phonetic level
The first stage in the synthesis phase is to take the words we have just found and encode them as phonemes We do this, because this provides a more compact representation for further synthesis processes to work on The words, phonemes and phrasing form an input specification to the unit selection module Actual synthesis
Input text
Text and linguistic analysis
Prosody and speech generation
Synthesized speech
Trang 22is performed by accessing a database of pre-recorded speech so as to find units contained there that match the input specification as closely as possible
The second stage is prosody and third stage is speech signal generation During the prosody stage linguistic information is used to generate F0 contours, timing information for the phones etc Finally, the synthesized speech itself is generated from these specifications If we are dealing with normal TTS, the generated speech will take the form of a audio signal
The pre-recorded speech can take the form of a database of waveform fragments and when a particular sequence of these are chosen, signal processing is used to stich them together to form a single continuous output speech waveform
This is essentially how (one type) of modern TTS works One may well ask why it takes an entire book to explain this then, but as we shall see, each stage of this process can be quite complicated, and so we give extensive background and justification for the approaches taken Additionally, while it is certainly possible to produce a system that speaks something with the above recipe, it is considerably more difficult to create a system that consistently produces high quality speech no matter what the input is
1.2.2 Speech synthesis techniques
According to the development history, speech synthesis technique can be divided into generations The techniques researched and developed in the first generation are formant synthesis and articulation synthesis
Formant synthesis was the first genuine synthesis technique to be developed and was the dominant technique until the early 1980s Formant synthesis is often called synthesis by rule;
Synthesis systems by concatenation are often collectively called generation synthesis systems
Trang 23second-However, in concatenative synthesis we can never collect enough data to cover all the effects we wish to synthesize, and often the coverage we have in the database
is very uneven Furthermore, the concatenative approach always limits us to recreating what we have recorded; in a sense all we are doing is reordering the original data
An alternative is to use statistical, machine-learning techniques to infer the specification-to-parameter mapping from data While this and the concatenative approach can both be described as data-driven, in the concatenative approach we are effectively memorizing the data, whereas in the statistical approach we are attempting to learn the general properties of the data
While many possible approaches to statistical synthesis are possible, most work has focused on using hidden Markov models (HMMs) or Deep Neural Network (DNN) These and the unit-selection techniques are termed third-generation techniques
This thesis focuses on a technique of third generation, which is the unit selection technique
1.2.3 Articulatory synthesis
Perhaps the most obvious way to synthesize speech is to try a direct simulation
of human speech production This approach is called articulatory synthesis and is actually the oldest in the sense that the famous talking machine of von Kempelen can be seen as an articulatory synthesizer (Von 1791)
Trang 24Figure 1-8 A schematic of the construction of an articulatory speech synthesizer and how a such a synthesizer may be considered to contain a model of information
encoding in the speech signal (Palo 2006)
In practice, acquiring data to determining rules and models is very difficult
Mimicking the human system closely can be very complex and computationally intractable
Because of these difficulties, there is little engineering work in articulatory synthesis, but it is central in the other areas of speech production, articulator physiology and audio-visual or talking-head synthesis
Trang 251.2.4 Formant synthesis
Formant synthesis was the first genuine synthesis technique to be developed and was the dominant technique until the early 1980s Formant synthesis is often called synthesis-by-rule; As we shall see, most formant synthesis techniques do in fact use rules of the traditional form, but data driven techniques have also been used
Figure 1-9 Block diagram of a synthesis-by-rule system Pitch and formants are listed as the only parameters of the synthesizer for convenience In practice, such
system has about 40 parameters (Huang et al 2001)
Summary
Formant synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-tract transfer function
An impulse train is used to generate voiced sounds and a noise source to generate obstruent sounds These are then passed through the filters to produce speech
The parameters of the formant synthesizer are determined by a set of rules concerning the phone characteristics and phone context
In general formant synthesis produces intelligible but not natural-sounding speech
It can be shown that very natural speech can be generated so long as the parameters are set very accurately Unfortunately, it is extremely hard to do this automatically
Trang 26 The inherent difficulty and complexity in designing formant rules by hand has led to this technique largely being abandoned for engineering purposes
Unfortunately, this is equivalent to assembling an automobile with parts of different colors: each part is very good yet there is a color discontinuity from part to part that makes the whole automobile unacceptable Speech segments are greatly affected by coarticulation (Olive et al 1993), so if we concatenate two speech segments that were not adjacent to each other, there can be spectral or prosodic discontinuities Spectral discontinuities occur when the formants at the concatenation point do not match Prosodic discontinuities occur when the pitch at the concatenation point does not match A listener rates as poor any synthetic speech that contains large discontinuities, even if each segment is very natural Thus, when designing a concatenative speech synthesis system we need to address the following issues:
1 What type of speech segment to use? We can use diphones, syllables, phonemes, words, phrases, etc
2 How to design the acoustic inventory, or set of speech segments, from a set
of recordings? This includes excising the speech segments from the set of
Trang 27recordings as well as deciding how many are necessary This is similar to the training problem in speech recognition
3 How to select the best string of speech segments from a given library of segments, and given a phonetic string and its prosody? There may be several strings
of speech segments that produce the same phonetic string and prosody This is similar to the search problem in speech recognition
4 How to alter the prosody of a speech segment to best match the desired output prosody
Generally, these concatenative systems suffer from great variability in quality: often they can offer excellent quality in one sentence and terrible quality in the next one If enough good units are available, a given test utterance can sound almost as good as a recorded utterance However, if several discontinuities occur, the synthesized utterance can have very poor quality While synthesizing arbitrary text
is still a challenge with these techniques, for restrictive domains this approach can yield excellent quality We examine all these issues in the following sections
We define unit as an abstract representation of a speech segment, such as its phonetic label, whereas we use instance as a speech segment from an utterance that
belongs to the same unit Thus, a system can keep several instances of a given unit
to select among them to better reduce the discontinuities at the boundaries This abstract representation consists of the unit‟s phonetic transcription at the minimum
in such a way that the concatenation of a set of units matches the target phonetic string In addition to the phonetic string, this representation can often include prosodic information
1.2.6 Statistical Parametric synthesis
Trang 29distribution and the Gamma distribution They are estimated from statistical variables obtained at the last iteration of the forward-backward algorithm
In the synthesis process, an inverse operation of speech recognition is performed First, a given word sequence is converted into a context-dependent label sequence, and then the utterance HMM is constructed by concatenating the context-dependent HMMs according to the label sequence Second, the speech parameter generation algorithm generates the sequences of spectral and excitation parameters from the utterance HMM Finally, a speech waveform is synthesized from the generated spectral and excitation parameters using excitation generation and a speech synthesis filter (Zen et al 2009), that is a vocoder with a source-excitation/filter model
Figure 1-11 General HMM-based synthesis scheme (Zen et al 2009)
Trang 30Figure 1-11 illustrates the general scheme of HMM-based synthesis (Zen et al 2009) In an HMM-based TTS system, a feature system is defined and a separate model is trained for each unique feature combination Spectrum, excitation, and duration are modeled simultaneously in a unified framework of HMMs because they have their own context dependency Their parameter distributions are clustered independently and contextually by using phonetic decision trees due to the combination explosion of contextual features The speech parameter generation is actually the concatenation of the models corresponding to the full context label sequence, which itself has been predicted from text Before generating parameters, a state sequence is chosen using the duration model “This determines how many frames will be generated from each state in the model This would clearly be a poor fit to real speech where the variations in speech parameters are much smoother”
Summary of HMM synthesis technique
HMM-based speech synthesis currently produces speech with remarkable fluidity (smoothness), but rather poor voice quality
It has, however, several important potential advantages over unit selection First, its use of context clustering is far more flexible than that of unit selection, since it allows for the creation of separate trees for spectral parameters F0 and duration Second, its coverage of the acoustic space is better, given the generative capability of the HMM/COC/Gaussian models Even more importantly, it embodies
a complete model of natural speech with very limited footprint (1 MB) Last but not least, it provides a natural framework for voice modification and conversion
Recently, statistical parametric synthesis has been adapted to the HMM-based synthesis of streams of articulatory parameters, instead of vocal-tract spectrum parameters In this case, a mapping function is required from the articulatory to
Trang 31acoustic domains This attempt to unify statistical and articulatory approaches is very promising (smoothness), but rather poor voice quality
1.3 From concatenative synthesis to unit selection synthesis
1.3.1 Extending concatenative synthesis
The observations about weaknesses of second-generation synthesis led to the development of a range of techniques known collectively as unit selection These use a richer variety of speech, with the aim of capturing more natural variation and relying less on signal processing The idea is that for each basic linguistic type we have a number of units, which vary in terms of prosody and other characteristics During synthesis, an algorithm selects one unit from the possible choices, in an attempt to find the best overall sequence of units which matches the specification
A progression can be identified from the second-generation techniques to blown unit selection With the realization that having exactly one example (i.e one unit) of each diphone was limiting the quality of the synthesis, the natural course of action was to store more than one unit Again, the natural way to do this is to consider features beyond pitch and timing (e.g stress or phrasing) and to have one unit for each of the extra features
full-So, for example, for each diphone, we could have a stressed and unstressed version and a phrase-final and non-phrase-final version So, instead of the type of specification used for second-generation systems, namely
we include additional linguistic features relating to stress, phrasing and so on:
Trang 32One way of realizing this is as a direct extension of the original diphone principle Instead of recording and analyzing one version of each diphone, we now record and analyze one version for each combination of specified features In principle, we can keep on expanding this methodology, so that, for example, if we wish to have phrase -initial, -medial and -final units of each diphone, or a unit for every type or variation of pitch accent, we simply design and record the data we require
As we use more features, we see that in practical terms the approach becomes increasingly difficult This is because we now have to collect significantly more data and do so in just such a way as to collect exactly one of each feature value Speakers cannot of course utter specific diphones in isolation; rather they must do
so in carrier words or phrases This has the consequence that the speaker is uttering speech in the carrier phrases that is not part of the required list of effects If we adhere strictly to this paradigm, we should throw this extra speech away, but this seems wasteful The unit-selection approach offers a solution to both these problems, which enables us to use the carrier speech and also lessen the problems arising from designing and recording a database that creates a unit for every feature value
In unit selection, the idea is that we obtain a database and perform the analysis such that potentially the entire database can be used as units in synthesis Systems vary regarding the degree to which the content of database is designed In some
Trang 33systems, an approach similar to that just outlined above is taken, in which the words, phrases or sentences to be spoken are carefully designed so as to illicit a specific range of required feature values Any extra speech is also added to the database as a beneficial side effect At the other extreme, we can take any database (designed or not), analyze it and take all the units we find within it as our final unit database The difference is really one of degree, since in both cases we will end up with an arbitrary number of each of the features we want; and, depending on how rich the feature set we use is, we may end up with many cases of missing units, that
is, feature combinations that we may require at synthesis time but for which there are no examples in the database This means that we need some technique for choosing amongst those units which match the specification and for dealing with cases in which an exact matching of features is not possible
A further issue concerns how we concatenate units in unit selection Recall that
in second-generation synthesis the diphones were specifically designed to join together well, in that they were all taken from relatively neutral phonetic contexts such that, when the two diphones were joined, the left side of the first diphone could be relied upon to join well with the right side of the second diphone The whole point of extending the range of units on offer is to increase variability, but this has the side effect of increasing the variability at the unit edges This results in a situation in which we cannot rely on the units always joining well, so steps must to
be taken to ensure that only unit combinations that will result in good joins are used
Unit selection is made possible by the provision of a significantly larger database than with second-generation techniques, and in fact it is clearly pointless having a sophisticated selection system if the choice of units is very limited With a large database we often find that long contiguous sections of speech are chosen, and this is one of the main factors responsible for the very high quality of the best utterances Often in unit selection no signal-processing modification is performed, and we refer to this approach as pure unit selection In fact, an alternative view of
Trang 34unit selection is that it is a resequencing algorithm, which simply cuts up speech and rearranges it Thinking of unit selection in this way can be helpful because it leads us to the principle of least modification
This states that the naturalness of the original database is of course perfect, and that any modification we perform, whether cutting, joining or using signal processing, runs the risk of making the original data sound worse Hence our aim should be to meet the specification by rearranging the original data in as few ways
as possible so as to try to preserve the “perfect” quality of the original
1.3.2 The algorithm of Hunt and Black
Various proposals were put forth in answer to these problems of managing larger databases of units, enabling selection within a class of units, coping with missing units and ensuring good joins (Iwahashi et al 1993), (Nakajima and Hamada 1988), (Sagisaka 1988) However, in 1996 Andrew Hunt and Alan Black (Hunt and Black 1996) proposed a general solution to the problem, which was the culmination of many years of unit-selection work at ATR labs In this (now-classic) paper, Hunt and Black put forward both a general framework for unit selection and specific algorithms for calculating the various components required by the framework
Trang 35Figure 1-12 A diagram of the Hunt and Black algorithm, showing one particular sequence of units and how the target cost measures a distance between a unit and the specification, and how the join cost measures a distance between the two
adjacent units (Taylor 2009)
For easy comparison with second-generation techniques, we will assume that
we also use diphones in unit selection, but a wide variety of other types is possible
As before, the specification is a list of diphone items 〈 〉 , each described by a feature structure The database is a set of diphone units, { } , each of which is also described by a feature structure The feature system is the set of features and values used to describe both the specification and the units, and this is chosen by the system builder in such a way as to satisfy a number of requirements The purpose of the unit-selection algorithm is to find the best sequence of units ̂ from the database U that satisfies the specification S
In the Hunt and Black framework, unit selection is defined as a search through
every possible sequence of units to find the best possible sequence of units There are several options as to how we define “best”, but in the original Hunt and Black
formulation this is defined as the lowest cost, as calculated from two local
Trang 36components First we have the target cost, , which is a cost or distance
between the specification and a unit in the database This cost is calculated
from specified values in the feature structure of each unit Second, we have the join
cost, , which is a measure of how well two units join (low values mean good joins) This is calculated for a pair of units in the database, again from specified values in the units‟ feature structures The total combined cost for a sentence is given by
∑
∑
and the goal of the search is to find the single sequence of units ̂ which minimizes this cost:
̂ {∑
∑
}
This search can be performed as a Viterbi-style search (Viterbi 1967)
The power of the Hunt and Black formulation as given in second Equation is that it is a fully general technique for unit selection We can generalize the idea of target and join costs in terms of target and join functions, which don‟t necessarily have to calculate a cost The target function is so called because it gives a measure
of how well a unit in the database matches the “target” given by the specification The join function again can accommodate a wide variety of formulations, all of which can encompass the notion of how well two units join Finally, the formulation of the algorithm as a search through the whole space of units allows us
to ensure that the algorithm has found the optimal set of units for the definitions of target and join functions that we have given
Of course, the very general nature of this algorithm means that there is enormous scope in how we specify the details In the next sections, we will discuss
Trang 37the issues of what features to use, how to formulate the target function, the join function, the issue of the choice of base type and, finally, search issues
1.3.3 Speech synthesis based on non-uniform units selection
1.3.3.1 Basic unit types
The basic unit type chosen in second-generation TTS systems was often the diphone, since diphones often produced good joins In unit selection, the greater variability in the units means that we can‟t always rely on diphones joining well, so the reasons for using diphones are somewhat less convincing Indeed, from a survey
of the literature, we see that almost every possible kind of base type has been used
In the following list we describe each type by its most common name, cite some systems that use this base type, and give some indication of the number of each type, where we assume that we have N unique phones and M unique syllables in our pronunciation system
frames : Individual frames of speech, which can be combined in any order
(Toshio Hirai 2004)
states : Parts of phones, often determined by the alignment of HMM states
(Donovan and Eide 1998), (Donovan and Woodland 1995)
half-phones : These are units that are “half” the size of a phone Thus, they are
either units that extend from the phone boundary to a mid-point (which can be defined in a number of ways), or units that extend from this mid-point to the end of
the phone There are 2N different half-phone types (Möhler and Conkie 1998)
diphones : These units extend from the point of one phone to the
mid-point of the next phone There are just fewer than N 2 diphones, since not all
combinations occur in practice (e.g /h-ng/) (Clark et al 2004), (Coorman et al 2000), (Tanya Lambert 2004), (Peter Rutten 2002)
Trang 38phones : Phones or phonemes as normally defined There are N of these
(Taylor and Black 1998), (Hunt and Black 1996) (T Saito 1996)
demi-syllables : The syllable equivalent of half-phones, that is, units that either
extend from a syllable boundary to the mid point of a syllable (the middle of the vowel) or extend from this mid point to the end of the syllable There are 2M demi-syllables (Steve Pearson 1998)
di-syllables : Units that extend from the middle of one syllable to the middle of
the next There are M 2 di-syllables (Chen and others 2003)
syllables : Syllables as normally defined (Jindrich Matousek 2005), (T Saito
1996), (Zhenli Yu 2004)
words : Words as normally defined (Portele et al 1996), (Stöber et al 1999),
(Christos Vosnidis 2001)
phrases : Phrases as normally defined (Donovan et al 1999)
The reasons why a developer chooses one base type over another are varied, and the choice is often simply down to personal preference Many systems use the
“family” of that which have joins in the middle of phones (half-phones, diphones, demi-syllables) because these are thought to produce better joins Sometimes the phonology of the language is the main consideration European languages are often considered phoneme based, therefore phones, diphones and half-phones are normal choices Chinese, by contrast, is often considered syllable-based, and many systems
in that language use the syllable or variant as their main unit (Chu et al 2003), (Meng et al 2002),(Xu et al 2003)
In addition to homogeneous systems, which use a single type of unit, we have
heterogeneous systems, which use a mixture of two or more types of unit One
reason for using different types of units is for dealing with cases when we have a primary unit of one type that requires some units of another type for joining
Trang 39purposes; a good example of this is the phrase-splicing system of Donovan (Donovan et al 1999), which concatenates whole canned phrases with smaller units
for names The term non-uniform unit synthesis was popular in the early
development of unit selection since it was seen that the explicit use of long sequences of contiguous speech was the key to improving naturalness
1.3.3.2 Non-uniform units selection
For a concatenative speech synthesizer, there are several possible choices for basic synthesis units, such as phonemes, diphones, demi-syllables, syllables, words
or phrases Both smaller units and larger units have advantages and disadvantages
Table 1-1 Types of using some popular units
Types of units Length
Number of concatenation point
Found probability
Table 1-1 shows the advantages and disadvantages of each type of units uniform unit selection will optimize the advantages of them: reduce number of
Trang 40Non-concatenation point by using phrase and syllable, ensure the ability of synthesizing almost syllable in Vietnamese by using half syllable But the complexity of this approach is that the problem in the use of three types of units requires a flexible process to exchange between the types of units
1.4 Conclusion
In chapter 1, basic knowledge of speech signal and speech synthesis techniques
in the world are presented
Each speech synthesis technique has its own advantages and disadvantages While articulatory synthesis and formant synthesis can combine unlimited amount
of sentences, the quality of synthetic voice is not really good (the voice is unnatural and similar to the voice of the robot) Concatenative synthesis and statistical synthesis (hidden Markov model synthesis) are the new-generation techniques These two techniques provide more natural voice and better voice quality Because the voice created by concatenative synthesis combines the voice clips recorded from human, generated sounds are more natural However, at concatenative points, interruptions still exist and the voice is not continuous Hidden Markov Model synthesis can solve this problem, but the voice quality is not very similar to
human‟s voice
Aiming to build a speech synthesis system that can be applied on smartphones,
we decided to choose the concatenative synthesis technique to research and develop our system, after considering strengths and weaknesses of all techniques