Mammone Rutgers University Xiaoyu Zhang Rutgers University 27.1 Introduction 27.2 Speech Production and Spectrum-Related Parameterization 27.3 Template-Based Speech Processing 27.4 Robus
Trang 1Mammone, R.J & Zhang, X “Robust Speech Processing as an Inverse Problem”
Digital Signal Processing Handbook
Ed Vijay K Madisetti and Douglas B Williams
Boca Raton: CRC Press LLC, 1999
Trang 2Robust Speech Processing as an
Inverse Problem
Richard J Mammone
Rutgers University
Xiaoyu Zhang
Rutgers University
27.1 Introduction 27.2 Speech Production and Spectrum-Related Parameterization 27.3 Template-Based Speech Processing
27.4 Robust Speech Processing 27.5 Affine Transform 27.6 Transformation of Predictor Coefficients Deterministic Convolutional Channel as a Linear Transform • Additive Noise as a Linear Transform
27.7 Affine Transform of Cepstral Coefficients 27.8 Parameters of Affine Transform
27.9 Correspondence of Cepstral Vectors References
27.1 Introduction
This section addresses the inverse problem in robust speech processing A problem that speaker and speech recognition systems regularly encounter in the commercialized applications is the dramatic degradation of performance due to the mismatch of the training and operating environments The mismatch generally results from the diversity of the operating environments For applications over the telephone network, the operating environments may vary from offices and laboratories to household places and airports The problem becomes worse when speech is transmitted over the wireless network Here the system experiences cross-channel interferences in addition to the channel and noise degradations that exist in the regular telephone network The key issue in robust speech processing is
to obtain good performance regardless of the mismatch in the environmental conditions The inverse problem in this sense refers to the process of modeling the mismatch in the form of a transformation and resolving it via an inverse transformation In this section, we introduce the method of modeling the mismatch as an affine transformation
Before getting into the details of the inverse problem in robust speech processing, we would like to give a brief review of the mechanism of speech production, as well as the retrieval of useful information from the speech for the recognition purposes
1999 by CRC Press LLC
Trang 327.2 Speech Production and Spectrum-Related Parameterization
The speech signal consists of time-varying acoustic waveforms produced as a result of acoustical excitation of the vocal tract It is nonstationary in that the vocal tract configuration changes over time A time-varying digital filter is generally used to describe the vocal tract characteristics The steady-state system function of the filter is of the form [1,2]:
S(z) = G
i=1 1− z i z−1 , (27.1) wherep is the order of the system and z idenote the poles of the transfer function The time domain representation of this filter is
s(n) =
p
X
i=1
a i s(n − i) + Gu(n) (27.2)
The speech samples(n) is predicted as a linear combination of previous p samples plus the excitation Gu(n), where G is the gain factor The factor G is generally ignored in the recognition-type tasks to
allow for robustness to variations in the energy of speech signals This speech production model is often referred to as the linear prediction (LP) model, or the autoregressive model, and the coefficients
a i are called the predictor coefficients.
The cepstrum of the speech signal s(n) is defined as
c(n) =
Z π
−πlog
Se jω e jωn dω
It is simply the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform
S(e jω ) of the signal s(n).
From the definition of cepstrum in Eq (27.3), we have
n=∞X
n=−∞
c(n)e −jωn= log S
e jω =
log
1
. (27.4)
If we differentiate both sides of the equation with respect toω and equate the coefficients of like
powers ofe jω, the following recursion is obtained:
c(n) =
logG n = 0 a(n) + 1
nPn−1
i=1 ic(i)a(n − i) n > 0
(27.5)
The cepstral coefficients can be calculated using the recursion once the predictor coefficients are
solved The zeroth order cepstral coefficient is generally ignored in speech and speaker recognition
due to its sensitivity to the gain factor,G.
An alternative solution for the cepstral coefficients is given by
c(n) = 1q
p
X
i=1
It is obtained by equating the terms of like powers ofz−1in the following equation:
n=∞X
n=−∞
c(n)z −n= logQp 1
n=1 1− z n z−1 = −
p
X
i=1
log
h
, (27.7)
Trang 4where the logarithm terms can be written as a power series expansion given as
log
h
k=1
1
k z k n z −k . (27.8)
There are two standard methods of solving for the predictor coefficients,a i , namely, the
autocor-relation method and the covariance method [3,4,5,6] Both approaches are based on minimizing the mean square value of the estimation errore(n) as given by
e(n) = s(n) −
p
X
i=1
a i s(n − i) (27.9)
The two methods differ with respect to the details of numerical implementation The autocorrelation method assumes that the speech samples are zero outside the processing interval ofN samples This
results in a nonzero prediction error,e(n), outside the interval The covariance method fixes the
interval over which the prediction error is computed and has no constraints on the sample values outside the interval The autocorrelation method is computationally simpler than the covariance approach and assures a stable system where all poles of the transfer function lie within the unit circle
A brief description of the autocorrelation method is given as follows
The autocorrelation of the signals(n) is defined as
r s (k) =
N−1−kX
n=0
s(n)s(n + k) = s(n) ⊗ s(−n) , (27.10)
whereN is the number of samples in the sequence s(n) and the sign ⊗ denotes the convolution
operation The definition of autocorrelation implies thatr s (k) is an even function The predictor
coefficientsa ican therefore be obtained by solving the following set of equations
r s (0) r s (1) · · · r s (p − 1)
r s (1) r s (0) · · · r s (p − 2)
r s (p − 1) r s (p − 2) · · · r s (0)
a1
a p
=
r s (1)
r s (p)
Denoting the p × p Toeplitz autocorrelation matrix on the left hand side by R s, the predictor
coefficient vector by a, and the autocorrelation coefficients by rs, we have
The solution for the predictor coefficient vector a can be solved by the inverse relation
a = Rs−1rs
This equation will be used throughout the analysis in the rest of this article Since the matrix Rsis Toeplitz, a computationally efficient algorithm known as Levinson-Durbin recursion can be used to
solve for a [3]
27.3 Template-Based Speech Processing
The template-based matching algorithms for speech processing are generally conducted using the similarity of the vocal tract characteristics inhabited in the spectrum of a particular speech sound
1999 by CRC Press LLC
Trang 5There are two types of speech sounds, namely, voiced and unvoiced sounds Figure27.1shows the speech waveforms, the spectra, and the spectral envelopes of the voiced and the unvoiced sounds Voiced sounds such as the vowel/a/ and the nasal sound /n/ are produced by the passage of a
quasi-periodic air wave through the vocal tract that creates resonances in the speech waveforms known
as formants The quasi-periodic air wave is generated as a result of the vibration of the vocal cord The fundamental frequency of the vibration is known as the pitch In the case of generating fricative
sounds such as/sh/, the vocal tract is excited by random noise, resulting in speech waveforms
exhibiting no periodicity, as can be seen in Fig.27.1 Therefore, the spectral envelopes of voiced sounds constantly exhibit the pitch as well as three to five formants when the sampling rate is 8 kHz, whereas the spectral envelopes of the unvoiced sounds reveal no pitch and formant characteristics
In addition, the formants of different voiced sounds differ with respect to the shape and the location
of the center frequencies of the formants This is due to the unique shape of the vocal tract formed to produce a particular sound Thus, different sounds can be distinguished based on attributes of the spectral envelope
The cepstral distance given by
d =
∞
X
n=−∞
c(n) − c0(n)2 (27.12)
is one of the metrics for measuring the similarity of two spectra envelopes The reason is as follows From the definition of cepstrum, we have
∞
X
n=∞
c(n) − c0(n)e jωn = log |Se jω
| − log |S0
e jω
|
= log |S e jω
|
|S0 e jω
The Fourier transform of the difference between a pair of cepstra is equal to the difference between the corresponding spectra pair By applying the Parseval’s theorem, the cepstral distance can be related
to the log spectral distance as
d =
∞
X
n=∞
c(n) − c0(n)2=
Z π
−π
h
log|Se jω| − log |S0e jω|i2dω
2π . (27.14)
The cepstral distance is usually approximated by the distance between the first few lower order cepstral coefficients, the reason being that the magnitude of the high order cepstral coefficients is small and has a negligible contribution to the cepstral distance
27.4 Robust Speech Processing
Robust speech processing attempts to maintain the performance of speaker and speech recognition system when variations in the operating environment are encountered This can be accomplished if the similarity in vocal tract structures of the same sound can be recovered under adverse conditions Figure27.2illustrates how the deterministic channel and random noise contaminate a speech signal during the recording and transmission of the signal
First of all, at the front end of the speech acquisition system, additive background noiseN1(ω)
from the speaking environment distorts the speech waveform Adverse background conditions are also found to put stress on the speech production system and change the characteristics of the vocal
Trang 6FIGURE 27.1: Illustration of voiced/unvoiced speech.
Trang 7FIGURE 27.2: The speech acquisition system.
tract It is equivalent to performing a linear filtering of the speech This problem will be addressed
in another chapter and will not be discussed here
After being sampled and quantized, the speech samples corrupted by the background noiseN1(ω)
are then passed through the transmission channel such as a telephone network to get to the receiver’s site The transmission channel generally involves two types of degradation sources: the deterministic and convolutional filter with the transfer functionH (ω), and the additive noise denoted by N2(ω)
in Fig.27.2
The signal observed at the output of the system is, therefore,
Y (ω) = H(ω) [X(ω) + N1(ω)] + N2(ω) (27.15) The spectrum of the output signal is corrupted by both additive and multiplicative interferences The multiplicative interference due to the linear channelH(ω) is sometimes referred to as the
mul-tiplicative noise
The various sources of degradation cause distortions of the predictor coefficients and the cepstral coefficients Fig.27.4shows the change of spatial clustering of the cepstral coefficients due to inter-ferences of the linear channel, white noise, and the composite effect of both linear channel and white noise
• When the speech is interfered by a linear bandpass channel, the frequency response of which is shown in Fig.27.3, a translation of the cepstral clusters is observed, as shown in Fig.27.4(b)
• When the speech is corrupted by Gaussian white noise of 15 dB SNR, a shrinkage of the cepstral vectors results This is shown in Fig.27.4(c), where it can be seen that the cepstral clusters move toward the origin
• When the speech is degraded by both the linear channel and Gaussian white noise, the cepstral vectors are translated and scaled simultaneously
There are three underlying thoughts behind the various solutions to robust speech processing The first is to recover the speech signal from the noisy observation by removing an estimate of the noise from the signal This is also known as the speech enhancement approach Methods that are executed in the speech sample domain include noise suppression [7] and noise masking [8] Other speech enhancement methods are carried out in the feature domain, for example, cepstral mean subtraction (CMS) and pole-filtered cepstral mean subtraction (PFCMS) In this category,
Trang 8FIGURE 27.3: The simulated environmental interference (a) Medium voiced channel and (b) Gaus-sian white noise
the key to the problem is to find feature sets that are invariant1 to the changes of transmission channel and environmental noise Liftered cepstrum [9] and the adaptive component weighted (ACW) cepstrum [10] are examples of the feature enhancement approach A third category consists
of methods for matching the testing features with the models after adaptation of environmental conditions [11,12,13,14] In this case, the presence of noise in the training and testing environments are tolerable as long as an adaptation algorithm can be found to match the conditions The adaptations can be performed in either of the following two directions, i.e., adapt the training data to the testing environment, or adapt the testing data to the environment
The focus of the following discussion will be on viewing the robust speech processing as an inverse problem We utilize the fact that both deterministic and non-deterministic noise introduce a sound-dependent linear transformation of the predictor coefficients of speech This can be approximated
by an affine transformation in the cepstrum domain The mismatch can, therefore, be resolved by solving for the inverse affine transformation of the cepstral coefficients
27.5 Affine Transform
An affine transform y of a vector x is defined as
y= Ax + b, for b 6= 0 (27.16)
The matrix, A, represents the linear transformation of the vector, x, and b is a nonzero vector representing the translation of the vector Note that the addition of the vector b to the equation
causes the transform to become nonlinear
The singular value decomposition (SVD) of the matrix,A, can be used to gain some insight into
the geometry of an affine transform, i.e.,
1 In practice, it is difficult to find a set of features invariant to the environmental changes The robust features currently used are mostly less sensitive to environmental changes.
1999 by CRC Press LLC
Trang 9FIGURE 27.4: The spatial distribution of cepstral coefficients under various conditions, “∗” for the sound /a/, “o” for the sound /n/, and “+” for the sound /sh/ (a) Cepstrum of the clean speech; (b) cepstrum of signals filtered by continental U.S mid-voice channel (CMV); (c) cepstrum of signals with 15 dB SNR, the noise type is additive white Gaussian (AWG); (d) cepstrum of speech corrupted
by both CMV channel and AWG noise of 15 dB SNR
where U and VT are unitary matrices and6 is a diagonal matrix The geometric interpretation is
thus seen to be thatx is rotated by unitary matrix V T, rescaled by the diagonal matrix6, rotated
again by the unitary matrix U, and finally translated by the vector b.
27.6 Transformation of Predictor Coefficients
It will be proved in this section that the contamination of a speech signal by a stationary convolutional channel and random white noise is equivalent to a signal dependent linear transformation of the predictor coefficients The conclusion drawn here will be used in the next section to show that the effect of environmental interference is equivalent to an affine transform in the cepstrum domain
27.6.1 Deterministic Convolutional Channel as a Linear Transform
When a sample sequence is passed through a convolutional channel of impulse responseh(n), the
filtered signals0(n) obtained at the output of the channel is
s0(n) = h(n) ⊗ s(n) (27.18)
Trang 10If the power spectra of the signalss(n) and s0(n) are denoted S s (ω), and S s0(ω), respectively, then
S s0(ω) = |H (ω)|2S s (ω) (27.19) Therefore, in the time domain,
r s0(k) = [h(n) ⊗ h(−n)] ⊗ r s (k) = r h (k) ⊗ r s (k) , (27.20) wherer s (k) and r s0(k) are the autocorrelation of the input and output signals The autocorrelation
of the impulse responseh(n) is denoted r h (k) and by definition,
r h (k) = h(n) ⊗ h(−n) (27.21)
If the impulse responseh(n) is assumed to be zero outside the interval [0, p − 1], then
r h (k) = 0 for |k| > p − 1 (27.22) Equation (27.20) can therefore be rewritten in matrix form as
r s0(0)
r s0(1)
r s0(p − 1)
r h (0) r h (1) r h (2) · · · r h (p − 1)
r h (1) r h (0) r h (1) · · · r h (p − 2)
r h (p − 1) r h (p − 2) r h (p − 3) · · · r h (0)
×
r s (0)
r s (1)
r s (p − 1)
Rh1refers to the autocorrelation matrix of the impulse response of the channel on the right-hand side of the above equation
The autocorrelation matrix Rs0of the filtered signals0(n) is then
Rs0 =
r s0(0) r s0(1) r s0(2) · · · r s0(p − 1)
r s0(1) r s0(0) r s0(1) · · · r s0(p − 2)
r s0(p − 1) r s0(p − 2) r s0(p − 3) · · · r s0(0)
=
r h (0) r h (1) r h (2) · · · r h (p − 1)
r h (1) r h (0) r h (1) · · · r h (p − 2)
r h (p − 1) r h (p − 2) r h (p − 3) · · · r h (0)
×
r s (0) r s (1) r s (2) · · · r s (p − 1)
r s (1) r s (0) r s (1) · · · r s (p − 2)
r s (p − 1) r s (p − 2) r s (p − 3) · · · r s (0)
1999 by CRC Press LLC
... various solutions to robust speech processing The first is to recover the speech signal from the noisy observation by removing an estimate of the noise from the signal This is also known as the... negligible contribution to the cepstral distance27.4 Robust Speech Processing< /b>
Robust speech processing attempts to maintain the performance of speaker and speech recognition... Figure27.2illustrates how the deterministic channel and random noise contaminate a speech signal during the recording and transmission of the signal
First of all, at the front end of the speech acquisition