Lecture Notes in Computer Science- P75 pdf

4.1 Speech Synthesis Algorithm This paper addresses a new speech synthesis method which takes the time-domain waveform editing algorithm as basic speech synthesis algorithm and overlaps

Trang 1

constituted of the oral cavity, the nasal cavity etc then change into the voice Ac-cording to the different vocal tract form, the airflow becomes into different speeches Thus we see the vocal tract parameters decide a certain voice

for example, the syllable “一” will be read as “yi1” The vocal tract will keep still when pronounce the phoneme “i”, but in word “一个” whose spelling is “yi1ge4”, at the end of the pronunciation of phoneme “i” the vocal tract is ready to change to fit the phoneme “g” So the pronunciation of “i” will be different from the single “i” in a word when another syllable is come with it

The new speech synthesis method this paper present is to make the synthesized speech presents the affection produced by the coarticulation in the word

3.3 The Sentence Prosody

Firstly we take a look at the prosody of words in a sentence Here we have to in-troduce a concept called “pitch resetting”, it is comparative with the “pitch con-tinuous” which means the latter syllable’s start pitch is equal to the previous one’s end pitch in a word

A sentence can divided into several words, the first syllable’s start pitch in each words will reset to a certain pitch, but in the word the syllable’s pitch is vary continuous We call

it “pitch resetting” Pitch resetting often happens when we exchange breath during read-ing We often take a word, a phrase or a short sentence as a breath exchanging unit As shown in figure 2, the sentence “系列报道感受二零零四今天播出” is segmented as spelling sequence “xi4lie4bao4dao4/ gan3shou4er4ling2ling2si4/ jin1tian1bo1chu1/ ” The sentence is composed by three phrase, from the figure 2 we see at the first syllable of each phrase the pitch is reset to certain value We can decide the pitch reset prosody boundary when we do Chinese word segmentation in Text proceeding

Fig 2 Chinese sentence prosody

Another feature of sentence prosody is whole sentence’s pitch trends In statement sentences the pitch trend is declining This trend overlaps on every syllable in the sentence So when pitch resetting occurs in a statement sentence the “definite pitch” will a little lower than it last time was

Consider the Chinese poetry’s reading feature; we assume the pitch resetting hap-pens in a single syllable’s end or a single word’s end

Trang 2

4 The Speech Synthesis Method

The TTS system mainly including three parts: text processing module, prosody module and speech synthesis module In speech synthesis module, what kind of speech syn-thesis algorithm should be chosen is most important As it is an important part of the TTS system, we make a close look at it

4.1 Speech Synthesis Algorithm

This paper addresses a new speech synthesis method which takes the time-domain waveform editing algorithm as basic speech synthesis algorithm and overlaps the vocal cepstrum parameters which get from homomorphism analysis on the adjacent syllables

in a word to smooth the speech transition affections The waveform editing synthesis whose advantage is rapid for process and vocal tract parametric synthesis whose ad-vantage is flexible for adjustment as it is considering the essence of the sounds

4.1.1 The Voice Database

Because the waveform editing algorithm is our basic algorithm, the voice database is needed to store all the elementary waveforms The voice database mainly stores the synthesis elements

The choice of the base synthesis element not only decides the quality of the final speech but also relative to the limit of the hardware storage ability So many Chinese TTS systems choose syllables, words or phrases, even sentences as the base synthesis element, which lead to a big voice database Our approach is taking initial consonant and simple/compound vowel as basic elements according to the reference [3] Thus the storage of voice database is cut down to several hundreds of KB meanwhile maintains a fairly equal level of voice quality

4.1.2 PSOLA Algorithm

E.Moulines and F.Charpentier found a speech synthesis algorithm based on time do-main waveform modification called PSOLA (Pitch Synchronous Overlap Add) [4] It is being widely used nowadays To know more detail about PSOLA algorithm please see the reference [5] and [6]

The PSOLA algorithm ensure the waveform and the spectrum persist smooth and continuous when the speech signal being modified It works by three steps As shown in figure 3 Firstly make a transform on a small segment of the original time domain waveform, whose duration is about 2 times of the pitch period, we call the transformed speech signal as short time temporary signal Then modify the temporary signal At last rebuild the time domain waveform from the modified temporary signal So we can do the modifications in step 2 to synthesis the speech we required

original time

domain waveform temporary signal temporary signalmodified domain wavformmodified time

Fig 3 Main steps in PSOLA algorithm

Trang 3

For example, if we want to synthesize a syllable of 400ms, but the corresponding syllable in voice database is 200ms, then we can process it as shown in figure 4

1 2 3 modified 1 modified 1 modified 2 modified 2 modified 3 modified 3 modified 3 synthesized waveform

original waveform

duration 200 ms

duration 400 ms

i/modified i : temporary signal transform

overlap

and transform

according to pitch information

Fig 4 PSOLA synthesis procedures

Firstly calculate how many temporary signals there should be in 400ms duration and calculate all the temporary signals of the original syllable’s waveform, then according

to the pitch information, find the temporary signals which about to be synthesized should equal with which ones in the original’s and arrange them on the duration line, finally overlap them to produce the synthesized speech

4.1.3 Concept of Cepstrum

We called time-domain signal sequence x ˆ n ( ) as the complex cepstrum of signal

sequence x (n ) The x ˆ n ( )is calculated by formula 1

)]]]

( [ [ln[

) ( n Z 1 Z x n

take the real part of x ˆ n ( )as c (n ), we called c (n ) the cepstrum and c (n ) is cal-culated by formula 2

|]

) ( [

| [ln )

( n Z 1 Z x n

4.1.4 Homomorphism Analysis to Get the Vocal Tract Cepstrum Parameters

The time domain speech signal x (n ) is the convolution of speech source signal

)

(n

e and vocal tract signal v (n ) in a simple digital speech model We have known that vocal tract contains the most important information of the speech, thus we want to separate the vocal tract signal and modify it in order to produce the speech we need

Trang 4

There is no good way to separate v (n ) from x (n ) in time domain, but the homomorphism analysis is helpful In homomorphism analysis, do Z transform on both sides of the equation 3

) ( ) ( ) ( n e n v n

The convolution is changed into product and we get the equation 4

) ( ) ( ) ( k E k V k

Do logarithm operation on both sides of the equation, then we change the product operation into linear operation and get the equation 5

)) ( ln(

)) (

Make it as equation 6

) ( ) ( )

Do Z−1 transform, the equation change into equation 7

) ( ) ( ) ( n e n v n

Now we can get the vocal tract cepstrum parameter v ˆ n ( ) by an linear filter After

we modified the vocal tract cepstrum parameter, the converse operation can be used to make the cepstrum domain signal v ˆ n ( ) back to time domain signalv (n )

4.1.5 Vocal Tract Cepstrum Parameter Speech Synthesis

When dealing with the adjacent syllables in one word during the speech synthesis, we could synthesize the speech through adding the latter syllable’s vocal tract cepstrum parameters into the former syllable

In the step 2 of PSOLA algorithm, after the temporary signal to be synthesized is

calculated we take the last k temporary signal’s vocal tract cepstrum parameters of the first syllable and the first k temporary signal’s vocal tract cepstrum parameters of the second syllable with a linear operation, the operation result as the first syllable’s last k

temporary signal’s new vocal tract cepstrum parameters Finally transform the cep-strum parameters and temporary signal back to time domain then we get the synthe-sized speech The linear operation method is shown in formula 8 The linear coefficient

is determined according to the reference [7]

) 2 sin(

1

~

25

2 1

25

2 1

25

2

1

K k v

v v

K k v

v v

v

b

b b

f

f f

f

π

⎥

⎦

⎤

⎢

⎣

⎡ +

⎥⎦

⎤

⎢⎣

⎡ −

×

⎥

⎦

⎤

⎢

⎣

⎡

=

⎥

⎦

⎤

⎢

⎣

⎡

#

(8)

Trang 5

In formula 8, v~fi is the first syllable’s modified vocal tract cepstrum, vfi is the first syllable’s original vocal tract cepstrum, vbi is the second syllable’s original vocal tract cepstrum

Thus we resolve the affection between the adjacent syllables

4.2 The Programming Implementation

The method this paper mention is implement under VC.net framework with C++ languages

The Figure 5 shows the logic procedure of the Chinese TTS system, When the Chinese poetry text is input into the TTS system we can predict the basic duration, pitch

of the syllable and then sentence mood, and then do words segmentation to mark the boundaries of pitch resetting, the next step is synthesize the speech with the consonants, vowels, tones which has been analyzed already by PSOLA algorithm, meanwhile to adjust the prosody of the adjacent syllables in one word with vocal tract cepstrum parameter synthesis algorithm, and finally get the synthesized speech

Chinese poetry text Chinese words Chinese syllables

Predicted duration

Predicted pitch

Pitch resetting boundary

consonant vowel tone Modify vocal tract parameter Changed tone type

Synthesized by PSOLA algorithm and cepstrum parameters algorithm sound

Words segmentation

Fig 5 The logic flow of TTS system

The final user interface including the waveform which is synthesized by the system

is shown in Figure 6

Fig 6 User interface

Định dạng
Số trang	5
Dung lượng	350,43 KB