Vietnamese Speech Recognition and Synthesis in Embedded System Using T-Engine

In this paper, we are concerned with the combination of speech recognition and synthesis engines and the implementation of them in T-Engine Embedded system

Trang 1

Vietnamese Speech Recognition and Synthesis in

Embedded System Using T-Engine

Trinh Van Loan, La The Vinh Department of Computer Engineering, Faculty of Information Technology, Hanoi University of Technology

1A DaiCoViet, Hanoi, Vietnam

Trang 2

Abstract – In Vietnam, researches in speech recognition and

synthesis have been started in recent years Together with the

developing trend of human-computer interaction systems using

speech, the optimization of speech recognition and synthesis

modules in both speed and quality is an important problem, in

order to combine the two modules in one interactive product.

Based on well-known methods for recognition and synthesis

problems, we do some experiment and enhancement to improve

both speed and quality of speech engines Finally, we demonstrate

a human-computer interaction software in T-Engine embedded

system.

In this paper, we are concerned with the combination of speech

recognition and synthesis engines and the implementation of

them in T-Engine Embedded system

Based on previous research in Vietnamese speech

recognition and synthesis, we propose some enhancements to

improve the synthetic speech quality Besides, the use of

system resource (memory, CPU, storage ) in implementation

is a considerable problem, especially for an embedded system

as T-Engine in our case

The paper is organized as follows In Section 2, a short

introduction of T-Engine is given, while a method proposed for

speech recognition in T-Engine is provided in Section 3,

following is Vietnamese speech synthesis method in Section 4

In the two last sections, we provide concluding remarks

II T-ENGINE INTRODUCTION

The T-Engine is a project to develop a standardized, open, real

time computing system and development environment The

T-Engine has standardized hardware (T-T-Engine board) as well as

real-time operating system (T-Kernel), the hardware includes:

 CPU with built-in MMU

- SH7660

- Clock 16.6667Mhz

- Speed 200Mhz (x12)

- EDS2516APTA

- 64 MB

 FLASH Memory

- 8 MB

- MBM29DL640

- 256 pins BGA

- ADS 7843

- 16 pins SSOP

 Realtime clock

- RV5C348B

- 10 pins SSOP

 CODEC (UDA1342)

- UDA1342

- Minimum sampling frequency 44KHz

- -51dB/Pa

Fig.1 T-Engine layout III PROPOSED METHOD FOR SPEECH RECOGNITION IN

T-ENGINE

Fig.2 Speech recognition in T-Engine The UDA1342 audio codec in T-Engine provides a minimal sampling frequency (SF) of 44100Hz This SF is really not necessary for our recognition, while it increases the computation considerably So, in order to improve the recognition speed we need a downsampling module with factor

of four for pre-processing of speech signal before the feature extraction phase

In the figure 2, feature extraction module and recognition model are the most important In our system we used MFCC feature of speech signal, since this one was proved to be a good feature of speech To form a feature vector, we firstly divide the signal into frames, then for each frame a feature vector is calculated including 13 MFCC values together with first and second derivative Assume that x[0 L-1] is the speech signal, the kth frame of the speech is constructed as:

s[n] = x[k*N+n] for n = 0 K-1 Where K is frame length, N is shift length of frames From 13 MFCC values m[0 12] the first and second derivative are calculated as following:

m1[0] = 0 m1[k] = m[k]-m[k-1] for k = 1 12 m2[0] = 0

m2[k] = m1[k] - m1[k-1] for k = 1 12 Where m1 and m2 are the first and the second derivative respectively In training phase, feature vectors are used to adjust HMM’s parameters such as number of Gaussian mixtures, transition probability matrix, observation probability matrix for the best observation of models with the input data Table 1 illustrates the experimenting results of our recognition system with MFCC features and twenty-Gaussian-mixtures, six-left-to-right-states hidden Markov models

Feature Extraction trainingModel

Trained model

Downsampling

to 11025 Hz

recognition output

Trang 3

Table I Speech recognition result

Note that, in our recognition engine, we separate the training

phase from the recognizing phase in order to reduce the system

resource used in T-Engine The training phase is implement in

a PC with speech data from T-Engine, only the recognition

module is implemented in T-Engine

IV VIETNAMESE SPEECH SYNTHESIS IN T-ENGINE

Previous researches in Vietnamese speech synthesis have

indicated that PSOLA is an effective method to synthesize

speech, based on the concatenation of diphones with amplitude

and tone balancing PSOLA is not only a good-quality method

but also a speed-optimal one Hence, PSOLA is very suitable

to implement in embedded systems, this is the reason why we

use it in our implementation of speech human-machine

interaction product using T-Engine However, in Vietnamese

there are some specific characteristics that make the speech

synthesis a little difference than in other languages Following

are some particular traits needed to consider in a Vietnamese

TTS system: Vietnamese is a monosyllable and tonal language,

there are six tones corresponding to difference varying rules of

the fundamental frequency (f0) of speech Because of these

features, there are two most common way of synthesizing

Vietnamese tones The first method is to change the f0 to get

the correspond tone, this way will reduce the size of speech

diphones data, and the complexity of f0 balancing

considerably But the quality of speech is not very good,

especially with “~” and “?” tones such as "bão" and "bảo"

because of the very complicated changing of amplitude

together with f0

The second one is to concatenate diphones with already

recorded tones In this manner, the size of data will increase

noticeably, but the tones are created exactly like the natural

speech, so the speech quality is quite good However f0

balancing when concatenated recorded-tone diphones is a little

more difficult To solve this problem we will cut diphones into

frames, each frame is one speech period

Fig.3 Speech signal frames Then, frames are multiplied with Hamming window

Fig.4 Speech signal frames multiplied by Hamming window

To keep the f0 contour smoothly, frames are overlapped with desired period

The two frames of contact are used for power balancing between the two diphones Assume that x(n) and y(n) are frames of contact with length N, we compute a power factor by:

Then the overlapping is done with second diphones’ frames multiply by p Assume that x[0 L-1] is the current speech signal, y[0 N-1] is the next frame, and K is the period The synthesis signal s[] is calculated by:

s[n] = x[n] for n = 0,1 L-N/2-1 s[n] = x[n] + p.y[n-L+N/2] for n = L-N/2 L-N/2+K-1 s[n] = p*y[L-N/2+k] for n = L-N/2+K L+K;

To reduce memory needed in implementation of the above algorithm for T-Engine embedded system, we store each diphone in a separate data file together with index file By this way, only two diphones are loaded in memory at the same time

so the memory is reduced considerably Table 2 is the database index file structure

Table II Diphone index file structure

Speakers Trained Test

Number Accuracy

Training

data:

1100 File

HMM:

6 states

20 Gauss

mixtures

Trang 4

Length Information

2 BYTE The end-point of the

first frame, is also the start-point

of a period of the vowel This field

is available for the first diphone only.

2 BYTE End-point of n period

period

All the data in the table above is calculated manually to ensure

the quality of synthesis speech In order to speed the creation

of database, we have built a database tool supporting auto

frame detecting This is a very useful tool for creating the

database with less effort This tool can produce a pitch contour

automatically from a wave data file with high accuracy, then

save the contour to a database index file as described in table

II

Fig.7 Waiting for speech commands from users screen shot

V APPLICATION OF VIETNAMESE RECOGNITION AND SYNTHESIS

Vietnamese speech recognition and synthesis has a wide range

of application, especially in human-computer interaction

(HCI)

Fig.6 Screen shot of HCM Museum introduction

To demonstrate the use of speech in HCI we have combined speech recognition together with speech synthesis into our software running in T-Engine This software allow users to use speech-commands to query information about places in Ha Noi Figure 7 illustrate the main screen of the software, in this screen user can see a map of Ha Noi with some places in bold title When an user read the title, for example "Bao tang Ho Chi Minh", the software will tell the user some information about the place Figure 6 is a screen shot of Ho Chi Minh museum introduction

VI CONCLUSIONS This paper is concerned with an advanced method in Vietnamese speech synthesis from tones-already diphones database We have done some experiments with the implementation of human-computer interaction system in T-Engine embedded system, besides our enhancement and optimization allow system to be implemented in low resource embedded systems The complete experiment consists of two part: recognizing and synthesizing, table I illustrate some recognition results with difference voice

Trang 5

REFERENCES [1] Dang Ngoc Duc, John-Paul Hosom2 and Luong Chi Mai, 'HMM/ANN System for Vietnamese Continuous Digit Recognition'

[2] Dang Ngoc Duc, Luong Chi Mai, 'Improve the Vietnamese Speech Recognition System Using Neural Network'

[3] Nguyen Van Giap, Tran Hong Viet, 'Kỹ thuật nhận dạng tiếng nói và ứng dụng trong điều khiển'

[4] Giang Tang, Jessica Barlơ, 'Characteristics of the sound systems of monolingual Vietnamese-speaking children with phonological impairment' [5] Le Hong Minh, Quach Tuan Ngoc, 'Some Results in Phonetic Analysis to Vietnamese Text-to-Speech Synthesis Based on Rules'

[6] Le Hong Minh, 'Some Results in Phonetic Analysis to Vietnamese Text-to-Speech Synthesis Based on Rules' ICT.rda 2003

[7] Wesley Mattheyses, Werner Verhelst and Piet Verhoev, 'Robust pitch marking for prosodic modification of speech using td-psola', 2006

Tiêu đề	Vietnamese speech recognition and synthesis in embedded system using T-engine
Tác giả	Trinh Van Loan, La The Vinh
Trường học	Hanoi University of Technology
Chuyên ngành	Computer Engineering
Thể loại	bachelor thesis
Thành phố	Hanoi

Định dạng
Số trang	5
Dung lượng	271 KB