Tài liệu Digital Signal Processing Handbook P27 pdf

Mammone Rutgers University Xiaoyu Zhang Rutgers University 27.1 Introduction 27.2 Speech Production and Spectrum-Related Parameterization 27.3 Template-Based Speech Processing 27.4 Robus

Trang 1

Mammone, R.J & Zhang, X “Robust Speech Processing as an Inverse Problem”

Digital Signal Processing Handbook

Ed Vijay K Madisetti and Douglas B Williams

Boca Raton: CRC Press LLC, 1999

Trang 2

Robust Speech Processing as an

Inverse Problem

Richard J Mammone

Rutgers University

Xiaoyu Zhang

Rutgers University

27.1 Introduction 27.2 Speech Production and Spectrum-Related Parameterization 27.3 Template-Based Speech Processing

27.4 Robust Speech Processing 27.5 Affine Transform 27.6 Transformation of Predictor Coefficients Deterministic Convolutional Channel as a Linear Transform • Additive Noise as a Linear Transform

27.7 Affine Transform of Cepstral Coefficients 27.8 Parameters of Affine Transform

27.9 Correspondence of Cepstral Vectors References

27.1 Introduction

This section addresses the inverse problem in robust speech processing A problem that speaker and speech recognition systems regularly encounter in the commercialized applications is the dramatic degradation of performance due to the mismatch of the training and operating environments The mismatch generally results from the diversity of the operating environments For applications over the telephone network, the operating environments may vary from offices and laboratories to household places and airports The problem becomes worse when speech is transmitted over the wireless network Here the system experiences cross-channel interferences in addition to the channel and noise degradations that exist in the regular telephone network The key issue in robust speech processing is

to obtain good performance regardless of the mismatch in the environmental conditions The inverse problem in this sense refers to the process of modeling the mismatch in the form of a transformation and resolving it via an inverse transformation In this section, we introduce the method of modeling the mismatch as an affine transformation

Before getting into the details of the inverse problem in robust speech processing, we would like to give a brief review of the mechanism of speech production, as well as the retrieval of useful information from the speech for the recognition purposes

1999 by CRC Press LLC

Trang 3

27.2 Speech Production and Spectrum-Related Parameterization

The speech signal consists of time-varying acoustic waveforms produced as a result of acoustical excitation of the vocal tract It is nonstationary in that the vocal tract configuration changes over time A time-varying digital filter is generally used to describe the vocal tract characteristics The steady-state system function of the filter is of the form [1,2]:

S(z) = G

i=1 1− z i z−1 , (27.1) wherep is the order of the system and z idenote the poles of the transfer function The time domain representation of this filter is

s(n) =

p

X

i=1

a i s(n − i) + Gu(n) (27.2)

The speech samples(n) is predicted as a linear combination of previous p samples plus the excitation Gu(n), where G is the gain factor The factor G is generally ignored in the recognition-type tasks to

allow for robustness to variations in the energy of speech signals This speech production model is often referred to as the linear prediction (LP) model, or the autoregressive model, and the coefficients

a i are called the predictor coefficients.

The cepstrum of the speech signal s(n) is defined as

c(n) =

Z π

−πlog

Se jω e jωn dω

It is simply the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform

S(e jω ) of the signal s(n).

From the definition of cepstrum in Eq (27.3), we have

n=∞X

n=−∞

c(n)e −jωn= log S

e jω =

log

1

. (27.4)

If we differentiate both sides of the equation with respect toω and equate the coefficients of like

powers ofe jω, the following recursion is obtained:

c(n) =







logG n = 0 a(n) + 1

nPn−1

i=1 ic(i)a(n − i) n > 0

(27.5)

The cepstral coefficients can be calculated using the recursion once the predictor coefficients are

solved The zeroth order cepstral coefficient is generally ignored in speech and speaker recognition

due to its sensitivity to the gain factor,G.

An alternative solution for the cepstral coefficients is given by

c(n) = 1q

p

X

i=1

It is obtained by equating the terms of like powers ofz−1in the following equation:

n=∞X

n=−∞

c(n)z −n= logQp 1

n=1 1− z n z−1 = −

p

X

i=1

log

h

, (27.7)

Trang 4

where the logarithm terms can be written as a power series expansion given as

log

h

k=1

1

k z k n z −k . (27.8)

There are two standard methods of solving for the predictor coefficients,a i , namely, the

autocor-relation method and the covariance method [3,4,5,6] Both approaches are based on minimizing the mean square value of the estimation errore(n) as given by

e(n) = s(n) −

p

X

i=1

a i s(n − i) (27.9)

The two methods differ with respect to the details of numerical implementation The autocorrelation method assumes that the speech samples are zero outside the processing interval ofN samples This

results in a nonzero prediction error,e(n), outside the interval The covariance method fixes the

interval over which the prediction error is computed and has no constraints on the sample values outside the interval The autocorrelation method is computationally simpler than the covariance approach and assures a stable system where all poles of the transfer function lie within the unit circle

A brief description of the autocorrelation method is given as follows

The autocorrelation of the signals(n) is defined as

r s (k) =

N−1−kX

n=0

s(n)s(n + k) = s(n) ⊗ s(−n) , (27.10)

whereN is the number of samples in the sequence s(n) and the sign ⊗ denotes the convolution

operation The definition of autocorrelation implies thatr s (k) is an even function The predictor

coefficientsa ican therefore be obtained by solving the following set of equations







r s (0) r s (1) · · · r s (p − 1)

r s (1) r s (0) · · · r s (p − 2)

r s (p − 1) r s (p − 2) · · · r s (0)













a1

a p





 =







r s (1)

r s (p)







Denoting the p × p Toeplitz autocorrelation matrix on the left hand side by R s, the predictor

coefficient vector by a, and the autocorrelation coefficients by rs, we have

The solution for the predictor coefficient vector a can be solved by the inverse relation

a = Rs−1rs

This equation will be used throughout the analysis in the rest of this article Since the matrix Rsis Toeplitz, a computationally efficient algorithm known as Levinson-Durbin recursion can be used to

solve for a [3]

27.3 Template-Based Speech Processing

The template-based matching algorithms for speech processing are generally conducted using the similarity of the vocal tract characteristics inhabited in the spectrum of a particular speech sound

Trang 5

There are two types of speech sounds, namely, voiced and unvoiced sounds Figure27.1shows the speech waveforms, the spectra, and the spectral envelopes of the voiced and the unvoiced sounds Voiced sounds such as the vowel/a/ and the nasal sound /n/ are produced by the passage of a

quasi-periodic air wave through the vocal tract that creates resonances in the speech waveforms known

as formants The quasi-periodic air wave is generated as a result of the vibration of the vocal cord The fundamental frequency of the vibration is known as the pitch In the case of generating fricative

sounds such as/sh/, the vocal tract is excited by random noise, resulting in speech waveforms

exhibiting no periodicity, as can be seen in Fig.27.1 Therefore, the spectral envelopes of voiced sounds constantly exhibit the pitch as well as three to five formants when the sampling rate is 8 kHz, whereas the spectral envelopes of the unvoiced sounds reveal no pitch and formant characteristics

In addition, the formants of different voiced sounds differ with respect to the shape and the location

of the center frequencies of the formants This is due to the unique shape of the vocal tract formed to produce a particular sound Thus, different sounds can be distinguished based on attributes of the spectral envelope

The cepstral distance given by

d =

∞

X

n=−∞

c(n) − c0(n)2 (27.12)

is one of the metrics for measuring the similarity of two spectra envelopes The reason is as follows From the definition of cepstrum, we have

∞

X

n=∞

c(n) − c0(n)e jωn = log |Se jω

| − log |S0

e jω

|

= log |S e jω

|

|S0 e jω

The Fourier transform of the difference between a pair of cepstra is equal to the difference between the corresponding spectra pair By applying the Parseval’s theorem, the cepstral distance can be related

to the log spectral distance as

d =

∞

X

n=∞

c(n) − c0(n)2=

Z π

−π

h

log|Se jω| − log |S0e jω|i2dω

2π . (27.14)

The cepstral distance is usually approximated by the distance between the first few lower order cepstral coefficients, the reason being that the magnitude of the high order cepstral coefficients is small and has a negligible contribution to the cepstral distance

27.4 Robust Speech Processing

Robust speech processing attempts to maintain the performance of speaker and speech recognition system when variations in the operating environment are encountered This can be accomplished if the similarity in vocal tract structures of the same sound can be recovered under adverse conditions Figure27.2illustrates how the deterministic channel and random noise contaminate a speech signal during the recording and transmission of the signal

First of all, at the front end of the speech acquisition system, additive background noiseN1(ω)

from the speaking environment distorts the speech waveform Adverse background conditions are also found to put stress on the speech production system and change the characteristics of the vocal

Trang 6

FIGURE 27.1: Illustration of voiced/unvoiced speech.

Trang 7

FIGURE 27.2: The speech acquisition system.

tract It is equivalent to performing a linear filtering of the speech This problem will be addressed

in another chapter and will not be discussed here

After being sampled and quantized, the speech samples corrupted by the background noiseN1(ω)

are then passed through the transmission channel such as a telephone network to get to the receiver’s site The transmission channel generally involves two types of degradation sources: the deterministic and convolutional filter with the transfer functionH (ω), and the additive noise denoted by N2(ω)

in Fig.27.2

The signal observed at the output of the system is, therefore,

Y (ω) = H(ω) [X(ω) + N1(ω)] + N2(ω) (27.15) The spectrum of the output signal is corrupted by both additive and multiplicative interferences The multiplicative interference due to the linear channelH(ω) is sometimes referred to as the

mul-tiplicative noise

The various sources of degradation cause distortions of the predictor coefficients and the cepstral coefficients Fig.27.4shows the change of spatial clustering of the cepstral coefficients due to inter-ferences of the linear channel, white noise, and the composite effect of both linear channel and white noise

• When the speech is interfered by a linear bandpass channel, the frequency response of which is shown in Fig.27.3, a translation of the cepstral clusters is observed, as shown in Fig.27.4(b)

• When the speech is corrupted by Gaussian white noise of 15 dB SNR, a shrinkage of the cepstral vectors results This is shown in Fig.27.4(c), where it can be seen that the cepstral clusters move toward the origin

• When the speech is degraded by both the linear channel and Gaussian white noise, the cepstral vectors are translated and scaled simultaneously

There are three underlying thoughts behind the various solutions to robust speech processing The first is to recover the speech signal from the noisy observation by removing an estimate of the noise from the signal This is also known as the speech enhancement approach Methods that are executed in the speech sample domain include noise suppression [7] and noise masking [8] Other speech enhancement methods are carried out in the feature domain, for example, cepstral mean subtraction (CMS) and pole-filtered cepstral mean subtraction (PFCMS) In this category,

Trang 8

FIGURE 27.3: The simulated environmental interference (a) Medium voiced channel and (b) Gaus-sian white noise

the key to the problem is to find feature sets that are invariant1 to the changes of transmission channel and environmental noise Liftered cepstrum [9] and the adaptive component weighted (ACW) cepstrum [10] are examples of the feature enhancement approach A third category consists

of methods for matching the testing features with the models after adaptation of environmental conditions [11,12,13,14] In this case, the presence of noise in the training and testing environments are tolerable as long as an adaptation algorithm can be found to match the conditions The adaptations can be performed in either of the following two directions, i.e., adapt the training data to the testing environment, or adapt the testing data to the environment

The focus of the following discussion will be on viewing the robust speech processing as an inverse problem We utilize the fact that both deterministic and non-deterministic noise introduce a sound-dependent linear transformation of the predictor coefficients of speech This can be approximated

by an affine transformation in the cepstrum domain The mismatch can, therefore, be resolved by solving for the inverse affine transformation of the cepstral coefficients

27.5 Affine Transform

An affine transform y of a vector x is defined as

y= Ax + b, for b 6= 0 (27.16)

The matrix, A, represents the linear transformation of the vector, x, and b is a nonzero vector representing the translation of the vector Note that the addition of the vector b to the equation

causes the transform to become nonlinear

The singular value decomposition (SVD) of the matrix,A, can be used to gain some insight into

the geometry of an affine transform, i.e.,

1 In practice, it is difficult to find a set of features invariant to the environmental changes The robust features currently used are mostly less sensitive to environmental changes.

Trang 9

FIGURE 27.4: The spatial distribution of cepstral coefficients under various conditions, “∗” for the sound /a/, “o” for the sound /n/, and “+” for the sound /sh/ (a) Cepstrum of the clean speech; (b) cepstrum of signals filtered by continental U.S mid-voice channel (CMV); (c) cepstrum of signals with 15 dB SNR, the noise type is additive white Gaussian (AWG); (d) cepstrum of speech corrupted

by both CMV channel and AWG noise of 15 dB SNR

where U and VT are unitary matrices and6 is a diagonal matrix The geometric interpretation is

thus seen to be thatx is rotated by unitary matrix V T, rescaled by the diagonal matrix6, rotated

again by the unitary matrix U, and finally translated by the vector b.

27.6 Transformation of Predictor Coefficients

It will be proved in this section that the contamination of a speech signal by a stationary convolutional channel and random white noise is equivalent to a signal dependent linear transformation of the predictor coefficients The conclusion drawn here will be used in the next section to show that the effect of environmental interference is equivalent to an affine transform in the cepstrum domain

27.6.1 Deterministic Convolutional Channel as a Linear Transform

When a sample sequence is passed through a convolutional channel of impulse responseh(n), the

filtered signals0(n) obtained at the output of the channel is

s0(n) = h(n) ⊗ s(n) (27.18)

Trang 10

If the power spectra of the signalss(n) and s0(n) are denoted S s (ω), and S s0(ω), respectively, then

S s0(ω) = |H (ω)|2S s (ω) (27.19) Therefore, in the time domain,

r s0(k) = [h(n) ⊗ h(−n)] ⊗ r s (k) = r h (k) ⊗ r s (k) , (27.20) wherer s (k) and r s0(k) are the autocorrelation of the input and output signals The autocorrelation

of the impulse responseh(n) is denoted r h (k) and by definition,

r h (k) = h(n) ⊗ h(−n) (27.21)

If the impulse responseh(n) is assumed to be zero outside the interval [0, p − 1], then

r h (k) = 0 for |k| > p − 1 (27.22) Equation (27.20) can therefore be rewritten in matrix form as







r s0(0)

r s0(1)

r s0(p − 1)











r h (0) r h (1) r h (2) · · · r h (p − 1)

r h (1) r h (0) r h (1) · · · r h (p − 2)

r h (p − 1) r h (p − 2) r h (p − 3) · · · r h (0)







×





r s (0)

r s (1)

r s (p − 1)





Rh1refers to the autocorrelation matrix of the impulse response of the channel on the right-hand side of the above equation

The autocorrelation matrix Rs0of the filtered signals0(n) is then

Rs0 =





r s0(0) r s0(1) r s0(2) · · · r s0(p − 1)

r s0(1) r s0(0) r s0(1) · · · r s0(p − 2)

r s0(p − 1) r s0(p − 2) r s0(p − 3) · · · r s0(0)





=







r h (0) r h (1) r h (2) · · · r h (p − 1)

r h (1) r h (0) r h (1) · · · r h (p − 2)

r h (p − 1) r h (p − 2) r h (p − 3) · · · r h (0)







×





r s (0) r s (1) r s (2) · · · r s (p − 1)

r s (1) r s (0) r s (1) · · · r s (p − 2)

r s (p − 1) r s (p − 2) r s (p − 3) · · · r s (0)





27.4 Robust Speech Processing< /b>

Robust speech processing attempts to maintain the performance of speaker and speech recognition... Figure27.2illustrates how the deterministic channel and random noise contaminate a speech signal during the recording and transmission of the signal

First of all, at the front end of the speech acquisition

Tiêu đề	Robust speech processing as an inverse problem
Tác giả	Richard J. Mammone, Xiaoyu Zhang
Người hướng dẫn	Vijay K. Madisetti, Editor, Douglas B. Williams, Editor
Trường học	Rutgers University
Chuyên ngành	Digital Signal Processing
Thể loại	Book chapter
Năm xuất bản	1999
Thành phố	Boca Raton

Định dạng
Số trang	19
Dung lượng	273,57 KB