4.1 Speech Synthesis Algorithm This paper addresses a new speech synthesis method which takes the time-domain waveform editing algorithm as basic speech synthesis algorithm and overlaps
Trang 1constituted of the oral cavity, the nasal cavity etc then change into the voice Ac-cording to the different vocal tract form, the airflow becomes into different speeches Thus we see the vocal tract parameters decide a certain voice
for example, the syllable “一” will be read as “yi1” The vocal tract will keep still when pronounce the phoneme “i”, but in word “一个” whose spelling is “yi1ge4”, at the end of the pronunciation of phoneme “i” the vocal tract is ready to change to fit the phoneme “g” So the pronunciation of “i” will be different from the single “i” in a word when another syllable is come with it
The new speech synthesis method this paper present is to make the synthesized speech presents the affection produced by the coarticulation in the word
3.3 The Sentence Prosody
Firstly we take a look at the prosody of words in a sentence Here we have to in-troduce a concept called “pitch resetting”, it is comparative with the “pitch con-tinuous” which means the latter syllable’s start pitch is equal to the previous one’s end pitch in a word
A sentence can divided into several words, the first syllable’s start pitch in each words will reset to a certain pitch, but in the word the syllable’s pitch is vary continuous We call
it “pitch resetting” Pitch resetting often happens when we exchange breath during read-ing We often take a word, a phrase or a short sentence as a breath exchanging unit As shown in figure 2, the sentence “系列报道感受二零零四今天播出” is segmented as spelling sequence “xi4lie4bao4dao4/ gan3shou4er4ling2ling2si4/ jin1tian1bo1chu1/ ” The sentence is composed by three phrase, from the figure 2 we see at the first syllable of each phrase the pitch is reset to certain value We can decide the pitch reset prosody boundary when we do Chinese word segmentation in Text proceeding
Fig 2 Chinese sentence prosody
Another feature of sentence prosody is whole sentence’s pitch trends In statement sentences the pitch trend is declining This trend overlaps on every syllable in the sentence So when pitch resetting occurs in a statement sentence the “definite pitch” will a little lower than it last time was
Consider the Chinese poetry’s reading feature; we assume the pitch resetting hap-pens in a single syllable’s end or a single word’s end
Trang 24 The Speech Synthesis Method
The TTS system mainly including three parts: text processing module, prosody module and speech synthesis module In speech synthesis module, what kind of speech syn-thesis algorithm should be chosen is most important As it is an important part of the TTS system, we make a close look at it
4.1 Speech Synthesis Algorithm
This paper addresses a new speech synthesis method which takes the time-domain waveform editing algorithm as basic speech synthesis algorithm and overlaps the vocal cepstrum parameters which get from homomorphism analysis on the adjacent syllables
in a word to smooth the speech transition affections The waveform editing synthesis whose advantage is rapid for process and vocal tract parametric synthesis whose ad-vantage is flexible for adjustment as it is considering the essence of the sounds
4.1.1 The Voice Database
Because the waveform editing algorithm is our basic algorithm, the voice database is needed to store all the elementary waveforms The voice database mainly stores the synthesis elements
The choice of the base synthesis element not only decides the quality of the final speech but also relative to the limit of the hardware storage ability So many Chinese TTS systems choose syllables, words or phrases, even sentences as the base synthesis element, which lead to a big voice database Our approach is taking initial consonant and simple/compound vowel as basic elements according to the reference [3] Thus the storage of voice database is cut down to several hundreds of KB meanwhile maintains a fairly equal level of voice quality
4.1.2 PSOLA Algorithm
E.Moulines and F.Charpentier found a speech synthesis algorithm based on time do-main waveform modification called PSOLA (Pitch Synchronous Overlap Add) [4] It is being widely used nowadays To know more detail about PSOLA algorithm please see the reference [5] and [6]
The PSOLA algorithm ensure the waveform and the spectrum persist smooth and continuous when the speech signal being modified It works by three steps As shown in figure 3 Firstly make a transform on a small segment of the original time domain waveform, whose duration is about 2 times of the pitch period, we call the transformed speech signal as short time temporary signal Then modify the temporary signal At last rebuild the time domain waveform from the modified temporary signal So we can do the modifications in step 2 to synthesis the speech we required
original time
domain waveform temporary signal temporary signalmodified domain wavformmodified time
Fig 3 Main steps in PSOLA algorithm
Trang 3For example, if we want to synthesize a syllable of 400ms, but the corresponding syllable in voice database is 200ms, then we can process it as shown in figure 4
1 2 3 modified 1 modified 1 modified 2 modified 2 modified 3 modified 3 modified 3 synthesized waveform
original waveform
duration 200 ms
duration 400 ms
i/modified i : temporary signal transform
overlap
and transform
according to pitch information
Fig 4 PSOLA synthesis procedures
Firstly calculate how many temporary signals there should be in 400ms duration and calculate all the temporary signals of the original syllable’s waveform, then according
to the pitch information, find the temporary signals which about to be synthesized should equal with which ones in the original’s and arrange them on the duration line, finally overlap them to produce the synthesized speech
4.1.3 Concept of Cepstrum
We called time-domain signal sequence x ˆ n ( ) as the complex cepstrum of signal
sequence x (n ) The x ˆ n ( )is calculated by formula 1
)]]]
( [ [ln[
) ( n Z 1 Z x n
take the real part of x ˆ n ( )as c (n ), we called c (n ) the cepstrum and c (n ) is cal-culated by formula 2
|]
) ( [
| [ln )
( n Z 1 Z x n
4.1.4 Homomorphism Analysis to Get the Vocal Tract Cepstrum Parameters
The time domain speech signal x (n ) is the convolution of speech source signal
)
(n
e and vocal tract signal v (n ) in a simple digital speech model We have known that vocal tract contains the most important information of the speech, thus we want to separate the vocal tract signal and modify it in order to produce the speech we need
Trang 4There is no good way to separate v (n ) from x (n ) in time domain, but the homomorphism analysis is helpful In homomorphism analysis, do Z transform on both sides of the equation 3
) ( ) ( ) ( n e n v n
The convolution is changed into product and we get the equation 4
) ( ) ( ) ( k E k V k
Do logarithm operation on both sides of the equation, then we change the product operation into linear operation and get the equation 5
)) ( ln(
)) ( ln(
)) (
Make it as equation 6
) ( ) ( )
Do Z−1 transform, the equation change into equation 7
) ( ) ( ) ( n e n v n
Now we can get the vocal tract cepstrum parameter v ˆ n ( ) by an linear filter After
we modified the vocal tract cepstrum parameter, the converse operation can be used to make the cepstrum domain signal v ˆ n ( ) back to time domain signalv (n )
4.1.5 Vocal Tract Cepstrum Parameter Speech Synthesis
When dealing with the adjacent syllables in one word during the speech synthesis, we could synthesize the speech through adding the latter syllable’s vocal tract cepstrum parameters into the former syllable
In the step 2 of PSOLA algorithm, after the temporary signal to be synthesized is
calculated we take the last k temporary signal’s vocal tract cepstrum parameters of the first syllable and the first k temporary signal’s vocal tract cepstrum parameters of the second syllable with a linear operation, the operation result as the first syllable’s last k
temporary signal’s new vocal tract cepstrum parameters Finally transform the cep-strum parameters and temporary signal back to time domain then we get the synthe-sized speech The linear operation method is shown in formula 8 The linear coefficient
is determined according to the reference [7]
) 2 sin(
) 2 sin(
1
~
~
~
25
2 1
25
2 1
25
2
1
K k v
v v
K k v
v v
v
v
v
b
b b
f
f f
f
f
f
π
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡ +
⎥⎦
⎤
⎢⎣
⎡ −
×
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎣
⎡
=
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎣
⎡
#
#
(8)
Trang 5In formula 8, v~fi is the first syllable’s modified vocal tract cepstrum, vfi is the first syllable’s original vocal tract cepstrum, vbi is the second syllable’s original vocal tract cepstrum
Thus we resolve the affection between the adjacent syllables
4.2 The Programming Implementation
The method this paper mention is implement under VC.net framework with C++ languages
The Figure 5 shows the logic procedure of the Chinese TTS system, When the Chinese poetry text is input into the TTS system we can predict the basic duration, pitch
of the syllable and then sentence mood, and then do words segmentation to mark the boundaries of pitch resetting, the next step is synthesize the speech with the consonants, vowels, tones which has been analyzed already by PSOLA algorithm, meanwhile to adjust the prosody of the adjacent syllables in one word with vocal tract cepstrum parameter synthesis algorithm, and finally get the synthesized speech
Chinese poetry text Chinese words Chinese syllables
Predicted duration
Predicted pitch
Pitch resetting boundary
consonant vowel tone Modify vocal tract parameter Changed tone type
Synthesized by PSOLA algorithm and cepstrum parameters algorithm sound
Words segmentation
Fig 5 The logic flow of TTS system
The final user interface including the waveform which is synthesized by the system
is shown in Figure 6
Fig 6 User interface