Báo cáo khoa học: "Unsupervised Learning of Acoustic Sub-word Units" pot

Unsupervised Learning of Acoustic Sub-word UnitsBalakrishnan Varadarajan∗ and Sanjeev Khudanpur∗ Center for Language and Speech Processing Johns Hopkins University Baltimore, MD 21218 {b

Trang 1

Unsupervised Learning of Acoustic Sub-word Units

Balakrishnan Varadarajan∗ and Sanjeev Khudanpur∗

Center for Language and Speech Processing

Johns Hopkins University Baltimore, MD 21218 {bvarada2 , khudanpur}@jhu.edu

Emmanuel Dupoux Laboratoire de Science Cognitive

et Psycholinguistique

75005, Paris, France emmanuel.dupoux@gmail.com

Abstract Accurate unsupervised learning of phonemes

of a language directly from speech is

demon-strated via an algorithm for joint unsupervised

learning of the topology and parameters of

a hidden Markov model (HMM); states and

short state-sequences through this HMM

cor-respond to the learnt sub-word units The

algorithm, originally proposed for

unsuper-vised learning of allophonic variations within

a given phoneme set, has been adapted to

learn without any knowledge of the phonemes.

An evaluation methodology is also proposed,

whereby the state-sequence that aligns to

a test utterance is transduced in an

auto-matic manner to a phoneme-sequence and

compared to its manual transcription Over

85% phoneme recognition accuracy is

demon-strated for speaker-dependent learning from

fluent, large-vocabulary speech.

1 Automatic Discovery of Phone(me)s

Statistical models learnt from data are extensively

used in modern automatic speech recognition (ASR)

systems Transcribed speech is used to estimate

con-ditional models of the acoustics given a

phoneme-sequence The phonemic pronunciation of words

and the phonemes of the language, however, are

derived almost entirely from linguistic knowledge

In this paper, we investigate whether the phonemes

may be learnt automatically from the speech signal

Automatic learning of phoneme-like units has

sig-nificant implications for theories of language

ac-quisition in babies, but our considerations here are

somewhat more technological We are interested in

developing ASR systems for languages or dialects

∗

This work was partially supported by National Science

Foundation Grants No

¯IIS-0534359 and OISE-0530118.

for which such linguistic knowledge is scarce or nonexistent, and in extending ASR techniques to recognition of signals other than speech, such as ma-nipulative gestures in endoscopic surgery Hence an algorithm for automatically learning an inventory of intermediate symbolic units—intermediate relative

to the acoustic or kinematic signal on one end and the word-sequence or surgical act on the other—is very desirable

Except for some early work on isolated word/digit recognition (Paliwal and Kulkarni, 1987; Wilpon

et al., 1987, etc), not much attention has been paid to automatic derivation of sub-word units from speech, perhaps because pronunciation lexicons are now available1 in languages of immediate interest What has been investigated is automatically learn-ing allophonic variations of each phoneme due to co-articulation or contextual effects (Takami and Sagayama, 1992; Fukada et al., 1996); the phoneme inventory is usually assumed to be known

The general idea in allophone learning is to be-gin with an inventory of only one allophone per phoneme, and incrementally refine the inventory to better fit the speech signal Typically, each phoneme

is modeled by a separate HMM In early stages of refinement, when very few allophones are available,

it is hoped that “similar” allophones of a phoneme will be modeled by shared HMM states, and that subsequent refinement will result in distinct states for different allophones The key therefore is to de-vise a scheme for successive refinement of a model shared by many allophones In the HMM setting, this amounts to simultaneously refining the topol-ogy and the model parameters A successive state splitting (SSS) algorithm to achieve this was pro-posed by Takami and Sagayama (1992), and

en-1

See http://www.ldc.upenn.edu/Catalog/byType.jsp

165

Trang 2

hanced by Singer and Ostendorf (1996)

Improve-ments in phoneme recognition accuracy using these

derived allophonic models over phonemic models

were obtained

In this paper, we investigate directly learning the

allophone inventory of a language from speech

with-out recourse to its phoneme set We begin with a

one-state HMM for all speech sounds and modify

the SSS algorithm to successively learn the

topol-ogy and parameters of HMMs with even larger

num-bers of states States sequences through this HMM

are expected to correspond to allophones The most

likely state-sequence for a speech segment is

inter-preted as an “allophonic labeling” of that speech by

the learnt model Performance is measured by

map-ping the resultant state-sequence to phonemes

One contribution of this paper is a significant

im-provement in the efficacy of the SSS algorithm as

described in Section 2 It is based on observing

that the improvement in the goodness of fit by up

to two consecutive splits of any of the current HMM

states can be evaluated concurrently and efficiently

Choosing the best subset of splits from among these

is then cast as a constrained knapsack problem, to

which an efficient solution is devised Another

con-tribution of this paper is a method to evaluate the

accuracy of the resulting “allophonic labeling,” as

described in Section 3 It is demonstrated that if

a small amount of phonetically transcribed speech

is used to learn a Markov (bigram) model of

state-sequences that arise from each phone, an

evalua-tion tool results with which we may measure phone

recognition accuracy, even though the HMM labels

the speech signal not with phonemes but merely a

state-sequence Section 4 presents experimental

re-sults, where the performance accuracies with

differ-ent learning setups are tabulated We also see how as

little as 5 minutes of speech is adequate for learning

the acoustic units

2 An Improved and Fast SSS Algorithm

The improvement of the SSS algorithm of Takami

and Sagayama (1992), renamed ML-SSS by Singer

and Ostendorf (1996), proceeds roughly as follows

1 Model all the speech2 using a 1-state HMM

with a diagonal-covariance Gaussian (N =1.)

2

Note that the original application of SSS was for learning

Figure 1: Modified four-way split of a state s.

2 For each HMM state s, compute the gain in log-likelihood (LL) of the speech by either a con-textual or a temporal split of s into two states

s1 and s2 Among the N states, select and and split the one that yields the most gain in LL

3 If the gain is above a threshold, retain the split and set N = N + 1; furthermore, if N is less than desired, re-estimate all parameters of the new HMM, and go to Step 2

Note that the key computational steps are the for-loop of Step 2 and the re-estimation of Step 3 Modifications to the ML-SSS Algorithm: We made the following modifications that are favorable

in terms of greater speed and larger search space, thereby yielding a gain in likelihood that is poten-tially greater than the original ML-SSS

1 Model all the speech using a 1-state HMM with

a full-covariance Gaussian density Set N = 1

2 Simultaneously replace each state s of the HMM with the 4-state topology shown in Fig-ure 1, yielding a 4N -state HMM If the state s had parameters (µs, Σs), then means of its 4-state replacement are µs 1 = µs− δ = µs4 and

µs 2 = µs+ δ = µs 3, with δ = λ∗v∗, where λ∗ and v∗ are the principal eigenvalue and eigen-vector of Σsand 0 < 1 is typically 0.2

3 Re-estimate all parameters of this (overgrown) HMM Gather the Gaussian sufficient statistics for each of the 4N states from the last pass

of re-estimation: the state occupancy πsi The sample mean µs i, and sample covariance Σs i

4 Each quartet of states (see Figure 1) that re-sulted from the same original state s can be

the allophonic variations of a phoneme; hence the phrase “all the speech” meant all the speech corresponding separately to each phoneme Here it really means all the speech.

Trang 3

merged back in different ways to produce 3, 2

or 1 HMM states There are 6 ways to end up

with 3 states, and 7 to end up with 2 states

Re-tain for further consideration the 4 state split of

s, the best merge back to 3 states among the 6

ways, the best merge back to 2 states among the

7 ways, and the merge back to 1 state

5 Reduce the number of states from 4N to N +∆

by optimally3merging back quartets that cause

the least loss in log-likelihood of the speech

6 Set N = N + ∆ If N is less than the desired

HMM size, retrain the HMM and go to Step 2

Observe that the 4-state split of Figure 1 permits a

slight look-ahead in our scheme in the sense that the

goodness of a contextual or temporal split of two

dif-ferent states can be compared in the same iteration

with two consecutive splits of a single state Also,

the split/merge statistics for a state are gathered in

our modified SSS assuming that the other states have

already been split, which facilitates consideration of

concurrent state splitting If s1, , sm are merged

into ˜s, the loss of log-likelihood in Step 4 is:

d

2

m

X

i=1

πs ilog |Σ˜| −d

2

m

X

i=1

πs ilog |Σs i| , (1)

where Σ˜=

Pm

i=1πs i Σs i+ µs iµ0si

Pm i=1πsi − µ˜µ

0

˜ Finally, in selecting the best ∆ states to add to the

HMM, we consider many more ways of splitting the

N original states than SSS does E.g going up from

N = 6 to N +∆ = 9 HMM states could be achieved

by a 4-way split of a single state, a 3-way split of one

state and 2-way of another, or a 2-way split of three

distinct states; all of them are explored in the process

of merging from 4N = 24 down to 9 states Yet, like

SSS, no original state s is permitted to merge with

another original state s0 This latter restriction leads

to an O(N5) algorithm for finding the best states to

merge down4 Details of the algorithm are ommited

for the sake of brevity

In summary, our modified ML-SSS algorithm can

leap-frog by ∆ states at a time, e.g ∆ = αN ,

com-pared to the standard algorithm, and it has the benefit

of some lookahead to avoid greediness

3

This entails solving a constrained knapsack problem.

4

This is a restricted version of the 0-1 knapsack problem.

3 Evaluating the Goodness of the Labels

The HMM learnt in Section 2 is capable of assign-ing state-labels to speech via the Viterbi algorithm Evaluating whether these labels are linguistically meaningful requires interpreting the labels in terms

of phonemes We do so as follows

Some phonetically transcribed speech is labeled with the learnt HMM, and the label sequences cor-responding to each phone segment are extracted Since the HMM was learnt from unlabeled speech, the labels and short label-sequences usually corre-spond to allophones, not phonemes Therefore, for each triphone, i.e each phone tagged with its left-and right-phone context, a simple bigram model of label sequences is estimated An unweighted “phone loop” that accepts all phone sequences is created, and composed with these bigram models to cre-ate a label-to-phone transducer capable of mapping HMM label sequences to phone sequences

Finally, the test speech (not used for HMM learn-ing, nor for estimating the bigram model) is treated

as having been “generated” by a source-channel model in which the label-to-phone transducer is the source—generating an HMM state-sequence—and the Gaussian densities of the learnt HMM states con-stitute the channel—taking the HMM state-sequence

as the channel input and generating the observed speech signal as the output Standard Viterbi decod-ing determines the most likely phone sequence for the test speech, and phone accuracy is measured by comparison with the manual phonetic transcription

4 Experimental Results

4.1 Impact of the Modified State Splitting The ML-SSS procedure estimates 2N different

N +1-state HMMs to grow from N to N +1 states Our procedure estimates one 4N state HMM to grow to N +∆, making it hugely faster for large N Table 1 compares the log-likelihood of the train-ing speech for ML-SSS and our procedure The re-sults validate our modifications, demonstrating that

at least in the regimes feasible for ML-SSS, there is

no loss (in fact a tiny gain) in fitting the speech data, and a big gain in computational effort5

5

ML-SSS with ∆=1 was impractical beyond N =22.

Trang 4

# of states SSS (∆ = 1) ∆ = 3 ∆ = N

8 -7.14 -7.13 -7.13

10 -7.08 -7.06 -7.06

22 -6.78 -6.76 N/A

40 N/A -6.23 -6.20

Table 1: Aggressive state splitting does not cause any

degradation in log-likelihood relative to ML-SSS.

4.2 Unsupervised Learning of Sub-word Units

We used about 30 minutes of phonetically

tran-scribed Japanese speech from one speaker6provided

by Maekawa (2003) for our unsupervised learning

experiments The speech was segmented via silence

detection into 800 utterances, which were further

partitioned into a 24-minute training set (80%) and

6-minute test set (20%)

Our first experiment was to learn an HMM from

the training speech using our modified ML-SSS

pro-cedure; we tried N = 22, 70 and 376 For each N ,

we then labeled the training speech using the learnt

HMM, used the phonetic transcription of the

train-ing speech to estimate label-bigram models for each

triphone, and built the label-to-phone transducer as

described in Section 3 We also investigated (i) using

only 5 minutes of training speech to learn the HMM,

but still labeling and using all 24 minutes to build

the label-to-phone transducer, and (ii) setting aside

5 minutes of training speech to learn the transducer

and using the rest to learn the HMM For each learnt

HMM+transducer pair, we phonetically labeled the

test speech

The results in the first column of Table 2 suggest

that the sub-word units learnt by the HMM are

in-deed interpretable as phones The second column

suggests that a small amount of speech (5 minutes)

may be adequate to learn these units consistently

The third column indicates that learning how to map

the learnt (allophonic) units to phones requires

rela-tively more transcribed speech

4.3 Inspecting the Learnt Sub-word Units

The most frequent 3-, 4- and 5-state sequences in the

automatically labeled speech consistently matched

particular phones in specific articulatory contexts, as

6

We heeded advice from the literature indicating that

au-tomatic methods model gross channel- and speaker-differences

before capturing differences between speech sounds.

HMM 24 min 5 min 19 min label-to-phone 24 min 24 min 5 min

27 states 71.4% 70.9% 60.2%

70 states 84.4% 84.7% 75.8%

376 states 87.2% 86.8% 76.6%

Table 2: Phone recognition accuracy for different HMM sizes (N ), and with different amounts of speech used to learn the HMM labeler and the label-to-phone transducer.

shown below, i.e the HMM learns allophones HMM labels L-contxt Phone R-contxt

11, 28, 32 vowel t [e|a|o]

15, 17, 2 [g|k] [u|o] [?]

3, 17, 2 [k|t|g|d] a [k|t|g|d]

31, 5, 13, 5 vowel [s|sj|sy] vowel

17, 2, 31, 11 [g|t|k|d] [a|o] [t|k]

3, 30, 22, 34 [?] a silence

6, 24, 8, 15, 22 [?] o silence

4, 3, 17, 2, 21 [k|t] a [k|t]

4, 17, 24, 2, 31 [s|sy|z] o [t|d]

[t|d] o [s|sy|z] For instance, the label sequence 3, 17, 2, corre-sponds to an “a” surrounded by stop consonants {t, d, k, g}; further restricting the sequence to

4, 3, 17, 2, 21, results in restricting the context to the unvoiced stops {t, k} That such clusters are learnt without knowledge of phones is remarkable

References

T Fukada, M Bacchiani, K K Paliwal, and Y Sagisaka.

1996 Speech recognition based on acoustically de-rived segment units In ICSLP, pages 1077–1080.

K Maekawa 2003 Corpus of spontaneous japanese: its design and evaluation In ISCA/IEEE Workshop on Spontaneous Speech Processing and Recognition.

K K Paliwal and A M Kulkarni 1987 Segmenta-tion and labeling using vector quantizaSegmenta-tion and its ap-plication in isolated word recognition Journal of the Acoustical Society of India, 15:102–110.

H Singer and M Ostendorf 1996 Maximum likelihood successive state splitting In ICASSP, pages 601–604.

J Takami and S Sagayama 1992 A successive state splitting algorithm for efficient allophone modeling.

In ICASSP, pages 573–576.

J G Wilpon, B H Juang, and L R Rabiner 1987 An investigation on the use of acoustic sub-word units for automatic speech recognition In ICASSP, pages 821– 824.

Định dạng
Số trang	4
Dung lượng	140,05 KB