Modeling of non native speech automatic speech recognition

Some researchers have attempted to address this problem by adaptation the native acoustic model with limited non-native speech data [10], [13], [15], [18].. 2.3.5 ANN Methods in Speech R

Trang 1

MODELING OF NON-NATIVE AUTOMATIC

DEPARTMENT OF COMPUTER SCIENCE

NATIONAL UNIVERSITY OF SINGAPORE

2011

Trang 2

Acknowledgements

This thesis would not have been possible without the guidance and the help of several

individuals who in one way or another contributed and extended their valuable assistance

in the process of the research

First and foremost, my utmost gratitude to Dr Sim Khe Chai, Assistant Professor at

the School of Computing (SoC), National University of Singapore (NUS) whose

sincerity and encouragement I will never forget Dr Sim has been my inspiration as I

hurdle all the obstacles in the completion this research work

Mr Li Bo, PHD candidate of the Computer Science Department of National

University of Singapore, for his unselfish and unfailing advice in implementing my

research project

Mr Wang Xuncong, PHD candidate of the Computer Science Department of National

University of Singapore He has shared his valuable suggestion in the relevance of the

fundamental knowledge in the automatic speech recognition system

Dr Mohan S Kankanhalli, Professor of Department of Computer Science, NUS, who

had kind concern and suggestion regarding my academic requirements

Last but not the least, my husband and my friends, for giving me the strength to plod

on the study and research in computer science department, which is a big challenge for

me as my bachelor degree is in Electrical and Electronic Engineering School

Trang 3

2.3.1 The phoneme and state in the ASR 9 2.3.2 The Theory of Hidden Markov Model 10 2.3.3 HMM Methods in Speech Recognition 12 2.3.4 Artificial Neural Network 15 2.3.5 ANN Methods in Speech Recognition 17

2.4.1 SD and SI Acoustic model 19 2.4.2 Model Adaptation using Linear Transformations (MLLR) 19 2.4.3 Model adaptation using MAP 22

Trang 4

CHAPTER 4 41

4.2 Project 1: Mixture level mapping for Mandarin acoustic model to English acoustic model 43

4.2.1 Overview of the Method in Project 1 43 4.2.2 Step Details in Project 1 46

4.5.1 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL 66 4.5.2 THE ACHIEVEMENT AND PROBLEM IN LEXICON MODEL 67

Trang 5

Summary

Heavily accented non-native speech represents a significant challenge for automatic speech recognition (ASR) Globalization again emphasizes the urgent of the research to address these challenges The ASR consists of three parts, acoustic modeling, lexical modeling and language modeling In the thesis, the author will first give a brief introduction of the research topic and the work has been done in chapter 1 In chapter 2, the author will explain the fundamental knowledge in the ASR system, the concepts and techniques illustrated in this chapter will be applied and used in the following chapters, especially in chapter 4 In chapter 3, the author will present her literature review, which will introduce the current concern in the natural language processing field and what are the challenges and the major approaches to address those challenges In chapter 4, the author presents her research has done so far Two projects are carried out to improve the acoustic model in recognition the non-native speech in Mandarin accent Another project

is targeted to improve the lexicon model of the word pronunciation of multi-national speakers The project process flow and step details are all covered In chapter 5, the author discusses her achievement and the problems regarding the results from the three projects, and then gives her conclusion and recommendation to the work she has done for this thesis

Trang 6

List of Figures

FIGURE 2.1 AUTOMATIC SPEECH RECOGNITION 4

FIGURE 2.2 FEATURE EXTRACTION 5

FIGURE 2.3 FEATURE EXTRACTION BY FILTER BAND 7

FIGURE 2.4 MEL FILTER BANK COEFFICIENT 8

FIGURE 2.5 HIDDEN MARKOV MODELS 11

FIGURE 2.6 ASR USING HMM 13

FIGURE 2.7 TRAIN HMM FROM MULTIPLE EXAMPLES 15

FIGURE 2.8 ILLUSTRATION OF NEURAL NETWORK 15

FIGURE 2.9 SIGMOIDAL FUNCTION FOR NN ACTIVATION NODE 16

FIGURE 2.10 FEEDFORWARD NEURAL NETWORK (LEFT) VS RECURRENT NEURAL NETWORK (RIGHT)16 FIGURE 2.11 ANN APPLIED IN ACOUSTIC MODEL 18

FIGURE 2.12 REGRESSION TREE FOR MLLR6 22

FIGURE 2.13 DICTIONARY FORMAT IN HTK 24

FIGURE 2.14 THE GRAMMAR BASED LANGUAGE MODEL 25

FIGURE 3.1 SUMMARY OF ACOUSTIC MODELING RESULTS 32

FIGURE 3.2 PROCEDURE FOR CONSTRUCTING THE CUSTOM MODEL 33

FIGURE 3.3 MAP AND MLLR ADAPTATION 35

FIGURE 3.4 PERFORMANCE WITH VARIOUS INTERPOLATION WEIGHTS 35

FIGURE 3.5 BEST RESULTS OF VARIOUS SYSTEMS 36

FIGURE 3.6 DIAGRAM OF HISPANIC-ENGLISH MULTI-PASS RECOGNITION SYSTEM 37

FIGURE 3.7 EXAMPLE OF THE MATCH BETWEEN STATE TARGET AND SOURCE 39

FIGURE 4.1 THE VOICE RECORDING COMMAND INPUTS AND OUTPUTS 42

FIGURE 4.2 PROJECT 1 PROCESS FLOW 45

FIGURE 4.3 THE INPUTS AND OUTPUT FOR THE MFCC FEATURES ABSTRACTION 47

FIGURE 4.4 THE INPUTS AND OUTPUT OF SPEECH RECOGNITION COMMAND 48

FIGURE 4.5 THE INPUTS AND OUTPUTS FOR REGRESSION TRESS GENERATION 48

FIGURE 4.6 INPUTS AND OUTPUTS FOR MLLR ADAPTATION 49

FIGURE 4.7 MODIFIED MODEL ILLUSTRATION 50

FIGURE 4.8 THE RESULTS FOR PROJECT 1 53

FIGURE 4.10 NN MODEL 1 TRAINING 56

FIGURE 4.11 NN MODEL 2 TRAINING (POE) 58

FIGURE 4.12 THE INVERSE FUNCTION OF GAUSSIAN 59

FIGURE 4.14 PROJECT 3 STEP 3 PROCESS FLOW 63

FIGURE 4.15 THE EXAMPLE FOR COMBINE.LIST 64

FIGURE 4.16 THE EXAMPLE FOR THE RESULT OF FIND_GENERAL.PERL 65

FIGURE 4.17 THE EXAMPLE FOR DICTIONARY 65

Trang 7

List of Tables

Table 2.1 The 39 CMU Phoneme Set 9

Table 2.2 The probability of the pronunciations 23

Table 3.1 Word error rate for the CSLR Sonic Recognizer 36

Table 3.2 World Error Rate % by Models1 37

Table 4.1 Non-native English collection data 43

Table 4.2 Mandarin mixture modified empty English model 53

Table 4.3 Results from NN Poe 61

Table 4.4 Results for project 3 66

Trang 8

However, with globalization and widespread emergence of speech a pplications, the need for more flexible ASR has never been greater The flexible ASR means that the system can be applied to multiple users, in a noisy environment, and even be applicable

to the non-native speakers In this research, the author will focus on improving the ASR performance for non-native speakers

There are many challenges in tackling the problems arisen from non-native ASR Firstly, there is a lack of non-native speech resource for model training Some researchers have attempted to address this problem by adaptation the native acoustic model with limited non-native speech data ([10], [13], [15], [18]) Secondly, for non-

Trang 9

native speakers, they have different nationality, thus different accent, to address those problems, some researchers have tried the methods of MLLR, MAP, interpolation, state-level mapping ([9], [10], [11], [12], [26]) Thirdly, even those non-native speaking with same hometown accent, they are at different level of proficient to the target language Those problems cause the non-native speech recognition accuracy reduced significantly

In the research covered in Chapter 4 of this thesis, the author will focus on the two core parts of ASR system to improve the accuracy of the non-native speech recognition Firstly, the author will attempt to improve the acoustic model of ASR Non-native acoustic modeling is an essential component in many practical ASR systems There are two projects related to this problem Secondly, the author also explores the issue in the lexicon model For non-native speakers, the pronunciations of some words are different from those for native speakers To make it worse, due to immaturity accent of non-native speakers, the discrimination for words have similar pronunciation becomes difficult In this thesis, there is one project targeted to address this problem

In this thesis, there are many professional terms, concepts and techniques in the field

of automatic speech recognition In order to make the reader understand clearly, the author has written all the key background knowledge in Chapter 2 In addition, many previous researchers have attempted to solve those problems with certain approaches, which also give insight to the author’s lateral projects The development history of ASR and those researchers’ approaches to address the similar issues in this thesis are included

in Chapter 3 Chapter 5 gives the conclusions and recommendations for possible future work

Trang 10

Chapter 2

Basic Knowledge in ASR System

In this chapter, the author will describe the basic knowledge in ASR system First, the author will give an overview of the ASR system After that, the author will de scribe recent frequently used acoustic feature format and its corresponding extra ction technique Then, the author will focus on the knowledge of acoustic model, lexicon model and language model one by one Understanding acoustic model is very important for the understanding of the ASR system, and the author will give more in depth knowledge in this part, including some advanced techniques used in acoustic model

2.1 Overview of Automatic Speech Recognition (ASR)

The Automatic Speech Recognition is a system to process a speech waveform file into a language text format, by which this system converts audio captured by the microphone into the text format language stored in the computer

An ASR system generally consists of three models, acoustic model, lexical model and language model As illustrated in Figure 2.1, the audio waveform file is first converted into a feature file, with reduced size and is matched to a particular acoustic model

Trang 11

Figure 2.1 Automatic Speech Recognition

The acoustic model accepts the feature file as input, and produces the phone sequence as output The lexical model is the bridge between the acoustic model and language model, and it models each vocabulary in the speech with one or more pronunciation, that is why sometimes we call the lexical model a dictionary in the field The language model accepts the optional word sequences produced by the lexical model as input, and produces the more common and grammatically accurate sentence If all the models are well trained, the ASR is capable of generating the corresponding sentence in text format from human natural language with little error rate

On the other hand, each of the models in the ASR system can be viewed as solving ambiguity at one of the language processing levels For example, the acoustic model disambiguates a sequence of feature frames from another sequence of feature frames, and categorizes them into a sequence of phones The lexical model disambiguates a sequence of phones from another sequence of phones, and categorizes them into a sequence of words The language model makes the disambiguation done by lexical

Trang 12

model more accurate by assigning higher probability to a more frequently occurred word sequence, or integrating the knowledge of sentence grammar

2.2 Feature Extraction

As previously mentioned, the acoustic model requires input in the format of feature file instead of waveform file This is because the feature file has much smaller size than the waveform file, while the feature file still maintains some of the important information and reduces some redundant and disturbing data, such as noisy

To abstract a feature file from a waveform file, many parameters have to be predefined First, we need to define the sampling rate for the feature file vector, usually

we call this sampling time interval as frame period For every frame period, a vector parameter will be generated from the waveform file, and be stored in the feature file This vector parameter is based on the magnitude on the frequency spectrum for a piece

of audio waveform around its frame sampling point Normally, the duration of the piece

of audio waveform is longer than the frame period, and we call it window durati on Thus, there is some overlap-sampled information for the nearby frames

Figure 2.2 Feature Extraction

Trang 13

For example (Figure 2.2), if the a waveform sampled at 0.0625 sec/sample, and the window duration is defined as 25 msec long, the frame period is defined as 10 msec Thus, nearby frame samples will have 15 msec overlap, every window duration will have

400 samples from speech waveform with 16 kHz sampling rate

By far, three forms of extracted feature vectors are mostly used Linear Prediction Coefficients (LPC)[1] is used very early in natural speech processing, but it is seldom used now Mel Frequency Cepstral Coefficients (MFCC)[ 2 ] is the default parameterization for many speech recognition applications, and the ASR system developed feature files on MFCC features have competitive performance Perceptual Linear Prediction Coefficients (PLP)[3] is developed more recently In the following part,

I will only focus on the MFCC feature extraction technique

To have a good understand about the MFCC features, we should first understand the concept of filterbank Filter banks are used to filter the information from the spectral magnitude of each window period (Figure 2.3) Filter banks are a series of triangular filters on the frequency spectrum To implement this filterbank, a window period of speech data is transformed using a Fourier transform and the magnitude on the frequency spectrum is obtained The speech data magnitude on the frequency spectrum is then multiplied by the corresponding filter gain from the filter bank and the results integrated Therefore, each filter bank will output a feature parameter, which is the integration of the multiplication in the previous way along a filter bank channel And the length of the feature vector in Figure 2.2 (previous one) depends on the number of feature parameter,

in other words, the length of the feature vector depends on the number of the filter banks

we defined in a frequency spectrum

Trang 14

Figure 2.3 Feature extraction by Filter Band 1

Usually, the triangular filters spread over the whole frequency range, thus unlimited

in quantity Practically, the lower and upper frequency cut-offs are defined, for example, only sample the information from the frequency spectrum range from 300 Hz to 3400

Hz, thus a certain number of triangular filter banks are spread over this range

Mel Frequency Cepstral Coefficients (MFCC) are coefficients calculated by feature vector from the filter banks But the filter banks used in MFCC are not equally spaced in the frequency spectrum, and we call it Mel filterbank Human can identify the voice better at the lower frequency than that at the higher frequency Practically, evidence also suggests that using a non-linear spaced filter bank superior the equally spaced filter bank Mel filterbank is such kind of filter banks designed to model the non-linear listening habits of human ear The equation below defines the mel-scale

( ) (

) The Mel filter banks are located according to the mel-scale, all the triangular filter banks locations are equally spaced on its mel-scale value instead of its frequency

spectrum (Figure 2.4)

1 K C SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 7, 2010

Trang 15

Figure 2.4 Mel filter bank coefficient 1

After obtained the feature vector from the non-linear filter bank, Mel-Frequency Cepstral Coefficients (MFCCs) can be calculated by the following formula

: kth Feature value from Mel filter bank

and : number of filter banks and number of MFCC coefficients

MFCCs are the waveform compression format for many speech recognition applications They give good discrimination and can be transformed to other forms easily However, compressing the waveform format into feature format still causes the loss of some information, and may reduce the robustness of the acoustic model been trained However, for a particular speech, the waveform file is about 6 times larger than the feature file, thus the training time of the former also is much longer than that of the latter That is why the feature extraction is so preferable in the natural language processing field

2.3 Acoustic Models

1 K C SIM, “Automatic Speech Recognition”, Speech Proceesing, PP, 8, 2010.

Trang 16

2.3.1 The phoneme and state in the ASR

In a speech utterance, it is easy to understand what is referred as a sentence, and what is referred as word The author will explain more about the phoneme and state in the natural language processing field

Speakers and listeners divide words into component sounds, which are phonemes In this research, we use the Carnegie Mellon University pronouncing dictionary [71], the phoneme set in this dictionary contains 39 phonemes (Table 2.1) The CMU dictionary is

a machine-readable pronunciation dictionary for North American English The forma t of this dictionary is very useful in the applications of speech recognition and synthesis The vowels in CMU phoneme set carry lexical stress

Table 2.1 The 39 CMU Phoneme Set

Vowel Phonemes: Consonant Phonemes:

Trang 17

spectrum as “s”, but “s” has very high frequency of around 4.5 kHz, while “sh” is lower frequency because tongue is further back

In the previous section, we know that feature extraction will give us the frequency spectrum feature information at a rate of every 10 msec ( 1 frame period) One phoneme can be several to tens of frames, which depends on the speakers’ rate and the context In order to model the flexibility of phoneme, we further divide a phoneme into several consequent states There is no standard definition for the number of states for a particular phoneme, and different researchers have slightly different definition for different purpose The author defines about 3 emitting states for a particular phoneme Each of the state captures the distinguishing feature of that phoneme for a defined window period And the state may be repeated multiple times until it jumps to the second state of a particular phoneme, and this is due to people have different duration in pronouncing the phonemes

2.3.2 The Theory of Hidden Markov Model

The most widely adopted statistical acoustic models in language processing are the Hidden Markov Models (HMMs) It is a finite-state transducer and is defined by states and the transitions between states (Figure 2.5)

In a regular Markov model, the state is directly visible to the observer, thus the model only has state transition probabilities parameters In a Hidden Markov model, the state is not directly visible, only output, which depends on the state, is visible Therefore, the HMM has both distribution model parameters for each state and the state tran sition probabilities parameters

Trang 18

Figure 2.5 Hidden Markov Models

As we can see from Figure 2.5, a HMM consists of a number of states Each state j has an associated observation probability distribution bj(ot), which is modeled to determine the probability of generating observation at a particular time t The is the feature coefficient (MFCC) we illustrated previously, which is sampled at a particular frame period for a window duration Each pair of states i and j has a modeled transition probability All those model parameters are obtained by statistical data driven training method In the experiment, the entry state 1 and the exist state N of an N state HMM are non-emitting states

Figure 2.5 shows a HMM with five states, and the HMM can be used to model a particular phoneme The three emitting states (2-4) have output probability distributions associated with them Each probability distribution of a state is represented by a mixture Gaussian density For example, for state j the probability bj(ot) is given by

Trang 19

( ) ∑ (

) Standard Hidden Markov Model has two important assumptions

1 Instantaneous first-order transition:

The probability of making a transition to the next state is independent of the historical states

2 Conditional independence assumption:

The probability of observing a feature vector is independent of the historical observations

2.3.3 HMM Methods in Speech Recognition

In the ASR system, the role of acoustic model is to decide the most likely phoneme for a given series of observations Hidden Markov model outputs the phoneme sequence in the following ways

Firstly, we should know that each phoneme is associated with a sequence of observations

The following maximum a-posteriori probability (MAP) is calculated to find out the best phoneme candidate associated with the given series of observations

* ( + The posteriori probability cannot be modeled in the acoustic model, but it can be

Trang 20

calculated indirectly from the likelihood model according to Bayes Rule

( ) ( ( )) ( )

In the above formula, the probability of every randomly generated observations is the same And the probability of a particular phoneme occurrence will be counted in a lexical model or a language model, and it can be ignored in an acoustic model Thus, the posteriori probability can be further simplified as following, which only the likelihood of the observations given the phoneme sequence matters

( ) ( )

The likelihood can be calculated directly from the HHM models For example, there are six state models (Figure 2.6) For an given observations, there are many possible sequences from the left-most non-emitting state to move to the right-most non-emitting state Take one possible sequence as example, and the sequence X=1; 2; 2; 3; 4; 4; 5; 6 is

assumed for 7 frames observations In Figure 2.6, when is read into the HMM, it makes the jump from state 1 to state 2, the is equal to 1, the Gaussian mixture probability in state 2 gives the likelihood of state 2 given the observation o1, which is ( ) According to sequence X, the sequence continuous stays in state 2 for one more frame, with some transition probability and the likelihood of state 2 given the

1 S Young, G Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp13, 2006

Trang 21

observation o2, which is ( ) When the comes, the sequence X jumps to the third state with transition probability and the likelihood of state 3 given the observation o3, which is ( ), the summation of all outgoing transition probabilities of a particular state is 1 In the case of state, the following equation is true

So on and so forth, the way the sequence of X=1; 2; 2; 3; 4; 4; 5; 6 is processed through the 6 states HMM phoneme model can be figured, and the likelihood of the 7 frames observation to belong to this particular phoneme model and the sequence of X is given by

( ) ( ) ( ) ( ) Because multiple sequences exist in a single phoneme model, the likelihood of a certain part of observations belong to a phoneme model is the summation of all the likelihood probability of all possible sequences for that phoneme model

In the experiment, the HMM model is developed by the statistical adjust the mode l parameter to adapt to the training data This requires a huge amount of data to train a good model Generally, an English acoustic HMM model is developed with about 300 hour speech, a Mandarin acoustic HMM model requires even more amount, due to more phonemes in Mandarin speech phone set

Trang 22

Figure 2.7 Train HMM from Multiple Examples 1

2.3.4 Artificial Neural Network

Neural Network often refers as artificial neural networks, which is composed of artificial neurons or nodes When inputs process into the model, there are weights assign to every input nodes, and then the summation of those scaled input nodes are processed into a

Figure 2.8 Illustration of neural network

transfer function, which is the activation node, mostly, we will use a log-sigmoid function, also known as a logistic function, the curve of this function is shown in Figure 2.9 After that, the output of those functions become the input node of the next layer And so on so forth, there can be several intermediate layers for different Neural Network

1 K C SIM, “Acoustic Modelling Hidden Markov Model”, Speech Proceesing, PP, 8, 2010

Trang 23

system, in our research, we will use the feedforward three layer neural network Anyway, every layer is essentially simple mathematical models defining a function , but those models’ parameters are also intimately associated with a particular learning algorithm or learning rule Back Propagation Algorithms is currently used in many neural network applications

Figure 2.9 Sigmoidal function for NN activation node

A feedforward neural network is an artificial neural network where connections between the nodes do not form a directed cycle, which exists in recurrent neural networks (Figure 2.10)

Figure 2.10 feedforward neural network (left) vs recurrent neural network (right)

Back propagation learning algorithm plays an important role in the neural network training process, it can be divided into two phases: propagation and weight update Phase 1: Propagation

Each propagation involves the following steps:

Forward propagation of a training pattern's input, thus one layer’s output becomes the

Trang 24

next layer’s input, finally, the end layer’s results are obtained

Back propagation of the propagation's output of each layer, at the final layer, a target results are compared with the output, and the target results of previous layers are calculated by the reverse function of the model from the current target results, this ends until reaching the first layer

Phase 2: Weight update

For each weight arc:

Here, we multiply its output delta (The difference between the output and the target output) and input activation (input node value) to get the gradient of the weight Then we bring the weight in the opposite direction of the gradient by subtracting a ratio of it from the weight

This ratio influences the speed and quality of neural network parameters learning, we call this ration the learning rate The sign of the gradient of a weight indicates where the error is increasing, and this is why the weight must be updated in the opposite direction Repeat the phase 1 and 2 until the performance of the network is good enough

QuickNet is a suite of software that facilitates the use of multi -layer perceptrons (MLPs) in statistical pattern recognition systems It is pri marily designed for use in speech processing but may be useful in other areas

2.3.5 ANN Methods in Speech Recognition

Recently, many research works also take the advantage of multi-layer perceptrons (MLPs) neural network to train the acoustic model and recognize the result using tandem connectionis feature extraction1 Hidden Markov model typically uses Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to observations of states of the phonemes In contrast, artificial neural network model uses discriminatively training to estimate the probability distribution

1

using the output of a neural network classifier as the input features for the Gaussian mixture models of a conventional speech recognizer (HMM), The resulting system, which effectively has two acoustic models in tandem

Trang 25

among states given the acoustic observations The traditional HMM is faster in training and recognition processes, and has better time alignment than the neural network acoustic model, especially when the HMM is a context dependent model On the other hand, neural network model can capture the state boundary better, and it is flexible to be manipulated and modified for different applications

The general application idea of using ANN in the language processing is to use the state probability output of the ANN as the input of a default Hidden Markov Model, then

we can recognize the final result using the HTK toolkit1 directly In details, we need to transfer the mfcc feature files into pfiles The manipulation of pfiles and the neural network training all require the use of QuickNet toolkit2 The pfiles will contain all the mfcc coefficients information for every frame sample Then the pfiles have to be combined into a single pfile The neural network is training towards a target output values Therefore, we have to first prepare a good acoustic model to obtain an alignment for the training data, then to transfer the alignment into the required format, which is ilab file (Figure 2.11)

Figure 2.11 ANN applied in acoustic model

Trang 26

2.4 Adaptation Techniques in Acoustic Model

2.4.1 SD and SI Acoustic model

With different training dataset, the acoustic model’s performance is different

Speaker dependent (SD) Acoustic Model is an Acoustic Model that has been trained using speech data from a particular person’s speech SD model will recognize that particular person’s speech correctly, but not a recommend model for the use to recognize speech data from other people However, an acoustic model trained from many speakers can gradually move towards Speaker dependent (SD) Model when applying the adaptation technique

Speaker independent (SI) Acoustic Model is trained using speech dat a contributed by

a big population Speaker independent Acoustic Model may not perform as good as the Speaker dependent Acoustic Model for a particular speaker, but it has much better performance for a general population, since SI captures the general characteristic for a group of people In further, the SD captures the intar-speaker variability better than SI model does, while the SI captures the inter-speaker variability better than SD model does

Usually, the dataset used in the training deviates somewhat from the speech data been tested, the acoustic model performs even worse when the deviation is large, for example, the American accented English speakers versus the England accented English speakers, the native English speakers versus the non-native English speech

To recognize the speech from non-native speakers, we usually will apply the technique of the adaptation for a well-trained native acoustic model, using the limited data Because the native speech data can be easily obtained from many open source, while the non-native speech data are rare and diverse

2.4.2 Model Adaptation using Linear Transformations (MLLR)

Trang 27

The Maximum likelihood linear regression (MLLR) computes transformations, which will reduce the mismatch between an original acoustic model and the non-native speech data been tested This calculation requires some adaptation data, which is the similar speech data in the test The transform matrices are obtained by solving a maximization problem using the Expectation-Maximization (EM) technique In details, MLLR is a model adaptation technique that estimates one or several linear transformations for the mean and variance parameters of Gaussian mixtures in the HMM model By applying the transformation, the Gaussian mixture means and variances are varied so that the modified HMM model can perform better for the testing dataset

The transformation matrix been estimated by the adaptation data can transform the original Gaussian Mixture mean vector to a new estimate,

⃗ =W ⃗ where W is the n (n+1) transformation matrix and ⃗ is the original mean vector with one more bias offset

⃗ , where w represents a bias offset, and in HTK toolkit, it is fixed at 1

-W can be decomposed into

W=, where A is an n n transformation matrix and b is a bias vector This form of transformation will adapt the original Gaussian mixtures’ means to the limited test data, and the author will continue to explain how the variances in the Gaussian mixtures are adapted

-Based on the standard in the HTK, there are two ways to adapt the variances linearly The first is of the form

where H is the linear transformation matrix to be estimated by adaptation data and B

is the inverse of the Choleski factor of , which is

Trang 28

=C and

The transformation matrix is obtained by the Expectation-Maximization (EM) technique, which can make use of the limited adaptation data to transform the original model by a big step

With increasing amount of adaptation data, more transformation matrixs can be estimated And each transformation matrix is used for a certain group of mixtures, which

is categorized in the regression class tree For example, if only a small amount of data is available, then maybe only one transform can be generated This transformation is applied to every Gaussian component in the model set However, with more adaptation data, two or even tens of transformations are estimated for different group of mixtures in original acoustic model And each of the transformation is more specific to group the gaussian mixtures further into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc Though it may not be classified so accurate in the adaptation for non-native speech, as non-native speech it is confused in the phone classification

Trang 29

Figure 2.12 regression tree for MLLR6 1

Figure 2.12 is a simple example of a binary regression tree with four base classes, denoted as { + The solid arrow and circle mean that there is sufficient data for a transformation matrix to be generated using the data associated with this class (usually will be defined by researcher in the application), while the dotted line and circle mean that there is insufficient data, for example, nodes 5, 6 and 7 Therefore, the transformation will only constructed for nodes 2, 3 and 4, which are and The data in group 5 will follow the transformation of , the data in group 6 and 7 will share the transformation of , while the data in group 4 will be transformed by

2.4.3 Model adaptation using MAP

Using limited adaptation data to transfer the original acoustic model also can be accomplished by the maximum a posteriori (MAP) adaptation technique Sometimes, MAP approach is referred as Bayesian adaptation

MLLR is an example of what is called transformation based adaptation, the parameters in a certain group of component model are transferred together with a single transform matrix In contrast to MLLR, MAP re-estimate the model parameters individually Sample mean values are calculated for the adaptation data An updated mean is then formed by shifting each of the original value toward the sample value

In the original acoustic model, the parameters are the information priors that are the

1

S Young, G Evermann, “The HTK Book”, version 3.4, Cambridge University Engineering Department, pp149, 2006

Trang 30

generated by previous data and are speaker independent model parameters The update formula for a single stream system for state j and mixture component m is

where is a weighting of the original model training data to the adaptation data,

likelihood of the adaptation data If there was insufficient adaptation data for a phone to reliably estimate a sample mean, there occupation likelihood will approach 0 and little adaptation is performed

2.5 Lexical Model

Lexical Model is the model to form the bridge between the acoustic model and language model, the lexical model defines the mapping between the words and the phone set Usually, it is a pronunciation dictionary For multiple pronunciations of a word, the probability of each pronunciation can be modeled too (Table 2.2)

Table 2.2 The probability of the pronunciations

Trang 31

phonemes are difficult to identify by machine learning currently

The following is an example for dictionary used in HTK (Figure 2.13) The “</s>” means sentence end, and “<s>” means sentence start, the “[]” stands that if the model recognize a sequence of phones are most likely to be the sentence start or sentence en d,

it will write nothing to the output If the acoustic model given the highest likelihood score to the phone sequence of “ ah sp” for the current observations, the recognizer will output word “A” in output file

Figure 2.13 Dictionary format in HTK

2.6 Language Model

A statistical language model assigns a probability to a sequence of m words, and this is

to model the grammar of a sentence The most frequently used language model in HTK

is n-gram language model, which is to predict each word in the sequence given its (n-1) previous words

The probability of a sentence can be decomposed as a product of conditional probability

( ) ∏ ( )

The n-gram models means to approximating the conditional probabilities to depend only on its previous n words

Trang 32

( ) ∏ ( )

Usually, we will model bigram or trigram in our experiment

The conditional probabilities are based on maximum likelihood estimates-that is, by counting the events in context on some given training text:

where C(W) is the count of a given word sequence in the training text

After the n-gram probability is stored in a database for the existing training data, some words may be mapped to an out of vocabulary class1

Sometimes, a more complicated Language Model can be archived by construct a grammar based word net:

In the grammar based language model, the plural, singular, prefix, and so on are defined by the standard English grammar manually, which is expensive and may be not useful practically As in non-native speech, the grammar is poorly formed unless the speech is read speech In addition, the training data set of LM should match the test data set For example, an LM trained on newspaper text would be a good predictor for

1

Sometimes the word sequence in the test data has not been seen before in training data, then we define some equivalence

classes and given the probability of this new word sequence.

2 K C SIM, “Statistical Language Model”, Speech Proceesing, PP, 34, 2010

Trang 33

recognizing the news reports, while the same LM would be a poor predictor for recognizing personal conversation or speech in hotel reservation system

Trang 34

Chapter 3

Literature Review

3.1 Overview of the challenges in ASR for non-native speech

The history of spoken language technology is milestoned by the ”SpeD” (“Speech and Dialogue”) Conferences After 2000, in the support of Academician Mihai Draganescu, former President of Romanian Academy, the organization of a Conference in the field of spoken language technology and human-computer dialogue (the first one was organized

in the 80’s years) is resumed

In the 2nd edition of this conference (2003), it shows the evolution from speech technology to spoken language technology Mihai Draganescu has mentioned:

“This had to be foreseen however from the very beginning involving the use of artificial intelligence, both for natural language processing and for acoustic - phonetic processes

of the spoken language The language technology is seen today with two great subdivisions: technologies of the written language and technologies of the spok en language These subdivisions have to work together in order to obtain a valuable and efficient human-computer dialogue” [4]

Trang 35

In 2005, there is no dramatic changes occurred in the research domain since the 2ndconference However, some trends were more and more obvious and some new fields of interest appeared to be a promise the future Corneliu Burileanu summarized:

In 2007, the SpeD edition is considered a very interesting and up-to-data analysis of the achievements in the domain, it also presents the future trends at “IEEE/ACL Workshop on Spoken Language Technology, Aruba, Dec, 11-13, 2006” The following research areas are strongly encouraged: spoken language understanding, dialog management, spoken language generation, spoken document retrieval, information extraction from speech, question answering from speech, spoken document summarization, machine translation of spoken language, speech data mining and search, voice-based human computer interfaces, spoken dialog systems, applications and standards, multimodal processing, systems and standards, machine learning for spoken language processing, speech and language processing in the World Wide Web

Biing-Hwang Juang and S Furui present a summary of system-level capabilities for spoken language translation:

“We were able to identify a constant development of what is called “speech interface technology” which includes automatic speech recognition, synthetic speech, and natural language processing We noticed commercial applications in computer command, consumer, data entry, speech-to-text, telephone, and voice verification Robust speaker- independent recognition systems for command and navigation in personal computers were already available; telephone-based transaction and database inquiry systems using both speech synthesis and recognition were coming into use” [5]

Trang 36

With recent development in Spoken Language Technology domain, the target research trend is more and more clear, the important challenges in those research direction are appeared

On the SpeD Conference in 2007, Hermann Hey, from Computer Science Department

of RWTH Aachen University, using Germany made his “Closing Remarks: How to Continue?” The main issue in this domain he emphasized is about the interaction between speech and NLP (natural language processing) people in many areas of interaction

Since 2007, one important challenge in this domain seemed to be “Speech to Speech Translation” The main issue is speech recognition improvement One aspect of the main issue is how to improve ASR with spontaneous, conversational speech in multiple languages Another aspect is that the translated text much be “speakable” for oral communication which means it is not enough to translate content adequately And one aspect is the cost-effective development of new languages and domains The last aspect

is the challenged intonation translation

In the 4th SpeD edition (2007), an important field of research is about the multilingual spoken language processing, citied from “Multilingual Spoken Language Processing,”

• first dialog demonstration systems: 1989-1993, restricted vocabulary, constrained speaking style, speed (2-10)xreal-time, platform-workstations,

• one-way phrasebooks: 1997-present, restricted vocabulary, constrained speaking style, speed (1-3)xreal-time, handheld devices,

• spontaneous two-way systems: 1993-present, unrestricted vocabulary, spontaneous speaking style, speed (1-5)xreal-time, PCs/handheld devices, • translation of broadcast news: 2003-present, unrestricted vocabulary, ready-prepared speech, offline, PCs/PC- clusters,

• simultaneous translation of lectures: 2005-present, unrestricted vocabulary, spontaneous speech, real-time, PCs/laptops

Trang 37

3.2 The solutions for non-native speech challenges

As illustrated in Section 3.1, the Spoken Language Technology is more and more emphasize on improving the ASR with spontaneous, conversational speech in multiple languages With the globalization, foreign accent is especially a crucial problem that ASR systems must address To recognize a foreign accent English speech with similar performance as to recognize the native accent English speech is the challenge of the acoustic model There are state level and phone level disturbance in foreign accent First challenge in the non-native acoustic model is that we cannot train a non-native acoustic model directly, this is due to the difficult to collect enough non -native speech data (about 300 to 500 hours) Usually, there are a lot of open data source for native speech, for example, speech database of Wall Street Journal, Broadcast News, Newswire Stories, and speech corpus from some research programs Unfortunately, there is lack of such broadcast or speech recording resource for non-native speech Usually, using native acoustic model to recognize the non-native speech, the word error rate is about 2 to 3 time that of the native speech[7] Before the non-native acoustic model can be improved

to a high level, the lexical model and language model cannot be expected to perform well

Therefore, the author want to focus her research to improve the acoustic model for non-native speech, especially in recognition Mandarin accented non-native speech to the

“With more than 6,900 languages in the world and the current trend of globalization, one of the most important challenges in spoken language technologies today is the need

to support multiple input and output languages, especially if applications are i ntended for international markets, linguistically diverse user communities, and non-native speakers In many cases, these applications have to support multiple languages simultaneously to meet the needs of a multicultural society Consequently, new algorithms and tools are required that support the simultaneous recognition of mixed - language input, the summarization of multilingual text and spoken documents, the generation of output in the appropriate language, or the accurate translation from one language to another” [6]

Trang 38

Target language of English

In fact, the variation among non-native speakers, even with the same motherland accent, is very large Those differences are characterized by different level of fluencies

in pronunciations, different level of familiarity with the target language, and different individual mistakes in pronounce unfamiliar words

The presence of a multitude of accents among non-native speakers, is unavoidable even we ignore the levels of proficiency, and this will dramatically degrade the ASR performance As mentioned in Section 3.1, the spoken language technology domain advocates the research to address these challenges just 3 years ago But in the beginning

of 21’s century, some research to tackle these challenges already emerged The most straightforward approach is to train a solo non-native acoustic model by the non-native speech data[8], however, the non-native speech data is rarely public available, which is explained just now Another approach is to apply general adaptation techniques such as MLLR and MAP with some testing data, by which the baseline acoustic model is modified toward a foreign accent.[9] Some researchers are working on the multilingual HMM for non-native speech.[10]Some researchers find methods to combine the native and limited non-native speech data, such as interpolation.[ 11] Some researchers apply both the recognizer combination methods and multilingual acoustical models on the non-native digit recognition.[12]

In 2001, Tomokiyo write a dissertation, “Recognizing non-native speech: Characterizing and adapting to non-native usage in LVSCR”[13], the author takes great detailed research to characterize low-proficiency non-native English spoken by Japanese Properties such as fluency, vocabulary, and pace in read and spontaneous speech are measured for both general and proficiency-controlled data sets

Trang 39

Figure 3.1 Summary of acoustic modeling results 1

Then Tomokiyo explores methods of adapting to non-native speech A summary of the individual contributions of each adapting method are shown in Figure 3.1 By using the data of Japanese-accented English and native Japanese, and the allophonic decision tree from the previous characterizing step, he apply both MLLR and MAP adaptation method with retraining and interpolate at the end, 29% relative WER reduction is archived over the baseline

The research of Tomokiyo shows us the non-native is very diverse Even just restrict the research to a specific source language, proficiency level, and model of speech From the result of characterizing, we see tremendous intra- and inter-speaker variation in the production of spoken language The study also shows that those non-native speakers sometimes generating common patterns and sometimes generating unique events that defy classification However, this dissertation has realized that by using a small amount

of non-native speech data, the recognition error for non-native speakers can be effectively reduced

Also in 2001, Wu and Chang also discuss approaches in which a little bit test speaker sentences are used to modify the already trained SI models to a customized model, it is presented in the “Cohorts based custom models for rapid speaker and dialect adaptation.”[14]

1

Tomokiyo, L M (2001) Recognizing non-native speech: Characterizing and adapting to non-native usage in LVSCR Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA.

Trang 40

It is well known that speaker dependent acoustic models perform much better for speaker independent acoustic models, as inter speaker variability is reduced, about 50% reduction can be achieved in the SD model compared with SI model In the paper, Wu can Chang present an approach that uses as few as three sentences from the test speaker

to select closest speakers (cohorts) from both the original training set and newly available training speakers to construct customized models

Firstly, the parameters of the speaker adapted model for each “on-call” speakers are estimated by MLLR technique mentioned in paper[ 15 ] The authors adopt only two criteria that can directly reflect the improvement in system performance to select cohorts The first one is the accuracy of the enrollment data (3 testing sentences) in a syllable recognition task The second one is the likelihood of the adaptation data after forced alignment against the true transcriptions, which is the true transcript text of the 3 testing sentence prepared ahead With the enrollment data, the speakers are sorted according to their likelihood and top N speakers with the highest likelihood are picked as the cohorts And the final cohort list is tuned according to both the syllable accuracy and the likelihood The data from the speakers listed in cohort are used to enhance the model

of the test speaker by a lot ways, such as the retraining technique, MAP or MLLR (Figure 3.2)

The results of research from Wu and Chang show that the cohorts based custom

1 Legetter C.J and Woodland P.C., “Maximum Likelihood Linear Regression for Speaker Adaptation”, Computer Speech and Language, Volume 9, No 2, pp 171-186

Định dạng
Số trang	88
Dung lượng	2,46 MB