In this paper, we are concerned with the combination of speech recognition and synthesis engines and the implementation of them in T-Engine Embedded system
Trang 1Vietnamese Speech Recognition and Synthesis in
Embedded System Using T-Engine
Trinh Van Loan, La The Vinh Department of Computer Engineering, Faculty of Information Technology, Hanoi University of Technology
1A DaiCoViet, Hanoi, Vietnam
Trang 2Abstract – In Vietnam, researches in speech recognition and
synthesis have been started in recent years Together with the
developing trend of human-computer interaction systems using
speech, the optimization of speech recognition and synthesis
modules in both speed and quality is an important problem, in
order to combine the two modules in one interactive product.
Based on well-known methods for recognition and synthesis
problems, we do some experiment and enhancement to improve
both speed and quality of speech engines Finally, we demonstrate
a human-computer interaction software in T-Engine embedded
system.
In this paper, we are concerned with the combination of speech
recognition and synthesis engines and the implementation of
them in T-Engine Embedded system
Based on previous research in Vietnamese speech
recognition and synthesis, we propose some enhancements to
improve the synthetic speech quality Besides, the use of
system resource (memory, CPU, storage ) in implementation
is a considerable problem, especially for an embedded system
as T-Engine in our case
The paper is organized as follows In Section 2, a short
introduction of T-Engine is given, while a method proposed for
speech recognition in T-Engine is provided in Section 3,
following is Vietnamese speech synthesis method in Section 4
In the two last sections, we provide concluding remarks
II T-ENGINE INTRODUCTION
The T-Engine is a project to develop a standardized, open, real
time computing system and development environment The
T-Engine has standardized hardware (T-T-Engine board) as well as
real-time operating system (T-Kernel), the hardware includes:
CPU with built-in MMU
- SH7660
- Clock 16.6667Mhz
- Speed 200Mhz (x12)
- EDS2516APTA
- 64 MB
FLASH Memory
- 8 MB
- MBM29DL640
- 256 pins BGA
- ADS 7843
- 16 pins SSOP
Realtime clock
- RV5C348B
- 10 pins SSOP
CODEC (UDA1342)
- UDA1342
- Minimum sampling frequency 44KHz
- -51dB/Pa
Fig.1 T-Engine layout III PROPOSED METHOD FOR SPEECH RECOGNITION IN
T-ENGINE
Fig.2 Speech recognition in T-Engine The UDA1342 audio codec in T-Engine provides a minimal sampling frequency (SF) of 44100Hz This SF is really not necessary for our recognition, while it increases the computation considerably So, in order to improve the recognition speed we need a downsampling module with factor
of four for pre-processing of speech signal before the feature extraction phase
In the figure 2, feature extraction module and recognition model are the most important In our system we used MFCC feature of speech signal, since this one was proved to be a good feature of speech To form a feature vector, we firstly divide the signal into frames, then for each frame a feature vector is calculated including 13 MFCC values together with first and second derivative Assume that x[0 L-1] is the speech signal, the kth frame of the speech is constructed as:
s[n] = x[k*N+n] for n = 0 K-1 Where K is frame length, N is shift length of frames From 13 MFCC values m[0 12] the first and second derivative are calculated as following:
m1[0] = 0 m1[k] = m[k]-m[k-1] for k = 1 12 m2[0] = 0
m2[k] = m1[k] - m1[k-1] for k = 1 12 Where m1 and m2 are the first and the second derivative respectively In training phase, feature vectors are used to adjust HMM’s parameters such as number of Gaussian mixtures, transition probability matrix, observation probability matrix for the best observation of models with the input data Table 1 illustrates the experimenting results of our recognition system with MFCC features and twenty-Gaussian-mixtures, six-left-to-right-states hidden Markov models
Feature Extraction trainingModel
Trained model
Downsampling
to 11025 Hz
recognition output
Trang 3Table I Speech recognition result
Note that, in our recognition engine, we separate the training
phase from the recognizing phase in order to reduce the system
resource used in T-Engine The training phase is implement in
a PC with speech data from T-Engine, only the recognition
module is implemented in T-Engine
IV VIETNAMESE SPEECH SYNTHESIS IN T-ENGINE
Previous researches in Vietnamese speech synthesis have
indicated that PSOLA is an effective method to synthesize
speech, based on the concatenation of diphones with amplitude
and tone balancing PSOLA is not only a good-quality method
but also a speed-optimal one Hence, PSOLA is very suitable
to implement in embedded systems, this is the reason why we
use it in our implementation of speech human-machine
interaction product using T-Engine However, in Vietnamese
there are some specific characteristics that make the speech
synthesis a little difference than in other languages Following
are some particular traits needed to consider in a Vietnamese
TTS system: Vietnamese is a monosyllable and tonal language,
there are six tones corresponding to difference varying rules of
the fundamental frequency (f0) of speech Because of these
features, there are two most common way of synthesizing
Vietnamese tones The first method is to change the f0 to get
the correspond tone, this way will reduce the size of speech
diphones data, and the complexity of f0 balancing
considerably But the quality of speech is not very good,
especially with “~” and “?” tones such as "bão" and "bảo"
because of the very complicated changing of amplitude
together with f0
The second one is to concatenate diphones with already
recorded tones In this manner, the size of data will increase
noticeably, but the tones are created exactly like the natural
speech, so the speech quality is quite good However f0
balancing when concatenated recorded-tone diphones is a little
more difficult To solve this problem we will cut diphones into
frames, each frame is one speech period
Fig.3 Speech signal frames Then, frames are multiplied with Hamming window
Fig.4 Speech signal frames multiplied by Hamming window
To keep the f0 contour smoothly, frames are overlapped with desired period
The two frames of contact are used for power balancing between the two diphones Assume that x(n) and y(n) are frames of contact with length N, we compute a power factor by:
Then the overlapping is done with second diphones’ frames multiply by p Assume that x[0 L-1] is the current speech signal, y[0 N-1] is the next frame, and K is the period The synthesis signal s[] is calculated by:
s[n] = x[n] for n = 0,1 L-N/2-1 s[n] = x[n] + p.y[n-L+N/2] for n = L-N/2 L-N/2+K-1 s[n] = p*y[L-N/2+k] for n = L-N/2+K L+K;
To reduce memory needed in implementation of the above algorithm for T-Engine embedded system, we store each diphone in a separate data file together with index file By this way, only two diphones are loaded in memory at the same time
so the memory is reduced considerably Table 2 is the database index file structure
Table II Diphone index file structure
Speakers Trained Test
Number Accuracy
Training
data:
1100 File
HMM:
6 states
20 Gauss
mixtures
Trang 4Length Information
2 BYTE The end-point of the
first frame, is also the start-point
of a period of the vowel This field
is available for the first diphone only.
2 BYTE End-point of n period
period
period
All the data in the table above is calculated manually to ensure
the quality of synthesis speech In order to speed the creation
of database, we have built a database tool supporting auto
frame detecting This is a very useful tool for creating the
database with less effort This tool can produce a pitch contour
automatically from a wave data file with high accuracy, then
save the contour to a database index file as described in table
II
Fig.7 Waiting for speech commands from users screen shot
V APPLICATION OF VIETNAMESE RECOGNITION AND SYNTHESIS
Vietnamese speech recognition and synthesis has a wide range
of application, especially in human-computer interaction
(HCI)
Fig.6 Screen shot of HCM Museum introduction
To demonstrate the use of speech in HCI we have combined speech recognition together with speech synthesis into our software running in T-Engine This software allow users to use speech-commands to query information about places in Ha Noi Figure 7 illustrate the main screen of the software, in this screen user can see a map of Ha Noi with some places in bold title When an user read the title, for example "Bao tang Ho Chi Minh", the software will tell the user some information about the place Figure 6 is a screen shot of Ho Chi Minh museum introduction
VI CONCLUSIONS This paper is concerned with an advanced method in Vietnamese speech synthesis from tones-already diphones database We have done some experiments with the implementation of human-computer interaction system in T-Engine embedded system, besides our enhancement and optimization allow system to be implemented in low resource embedded systems The complete experiment consists of two part: recognizing and synthesizing, table I illustrate some recognition results with difference voice
Trang 5REFERENCES [1] Dang Ngoc Duc, John-Paul Hosom2 and Luong Chi Mai, 'HMM/ANN System for Vietnamese Continuous Digit Recognition'
[2] Dang Ngoc Duc, Luong Chi Mai, 'Improve the Vietnamese Speech Recognition System Using Neural Network'
[3] Nguyen Van Giap, Tran Hong Viet, 'Kỹ thuật nhận dạng tiếng nói và ứng dụng trong điều khiển'
[4] Giang Tang, Jessica Barlơ, 'Characteristics of the sound systems of monolingual Vietnamese-speaking children with phonological impairment' [5] Le Hong Minh, Quach Tuan Ngoc, 'Some Results in Phonetic Analysis to Vietnamese Text-to-Speech Synthesis Based on Rules'
[6] Le Hong Minh, 'Some Results in Phonetic Analysis to Vietnamese Text-to-Speech Synthesis Based on Rules' ICT.rda 2003
[7] Wesley Mattheyses, Werner Verhelst and Piet Verhoev, 'Robust pitch marking for prosodic modification of speech using td-psola', 2006