Stochastic Models for Speech

Part IV MULTIPLE ACCESS AND ADVANCED TRANSCEIVER SCHEMES 363

15.3 Stochastic Models for Speech

With all the physical insights obtained from speech production and perception, speech coding has only the actual acoustic waveform to work with and, as this is an information-carrying signal, a stochastic modeling framework is called for. At first, we need to decide how to incorporate the time-varying aspects of the physical signal generation mechanism which suggests utilization of a nonstationary5stochastic processmodel. There are two “external” sources for this nonstationarity, best described in terms of the time variations of the acoustic wave propagation channel:

• The movements of the articulators which shape the boundary conditions of the vocal tract at a rate of 10 to 20 speech sounds/second. As the vocal tract impulse response typically has a delay spread of less than 50 ms, ashort-time stationary representationis adequate for this aspect of nonstationarity.

• The movements of the vocal folds at fundamental frequencies from 100 to 500 Hz give rise to a nearly periodic change in the boundary conditions of the vocal tract at its lower end – i.e., the vocal folds. In this case, comparison with delay spread suggests that the time variation is too fast for a short-time stationary model. Doppler-shift-induced modulation effects become important and only acyclostationary representation can cope with this effect (compare Section 15.3.5).

5An alternative view would introduce a hypermodel that controls the evolution of time-varying speech production model parameters. A stationary hypermodel could reflect the stationary process of speaking randomly selected utterances such that the overall two-tiered speech signal model would turn out to be stationary.

Speech Coding 327

Most classical speech models neglect cyclostationary aspects; so, let us begin by discussing the class of short-time stationary models. For these models, stochastic properties vary slowly enough to allow the use of a subsampled estimator – i.e., we need to reestimate the model parameters only once for each speech signal frame where in most systems a new frame is obtained every 20 ms (corresponding toN=160 samples at 8 kHz). For some applications – like signal generation which uses the parameterized model in the decoder – the parameters need to be updated more frequently (e.g., for every 5-ms subframe) which can be achieved by interpolation from available frame- based estimates.

Wold’s Decomposition

Within a frame, the signal is regarded as a sample function of a stationary stochastic process. This view allows us to applyWold’s decomposition[Papoulis 1985] which guarantees thatany stationary stochastic process can be decomposed into the sum of two components: aregular component xn(r) which can best be understood asfiltered noise and which is not perfectly predictable using linear systems and asingular component x(s)n which is essentially asum of sinusoidsthat can be perfectly predicted with linear systems:

xn=xn(r)+x(s)n (15.1)

Note that the sinusoids need not be harmonically related and may have random phases and amplitudes. In the following three subsections, we will show that this result – already estab- lished in the theory of stochastic processes by 1938 – serves as the basis for a number of current speech models which only differ in the implementation details of this generic approach. These models are the Linear Predictive voCoder (LPC), sinusoidal modeling, and Harmonic + Noise Modeling (HNM).

15.3.2 Linear Predictive voCoder (LPC)

The first implementation of Wold’s decomposition emphasizes its relationship to the Linear Pre- diction (LP) theory as first proposed by Itakura and Saito [1968]. The model derives its structure from the representation of the regular component as white noise run through a linear filter, which is causal, stable, and has a causal and stable inverse – i.e., aminimum-phase filter. The same filter is used for the generation of the singular component in voiced speech where it mostly consists of a number of harmonically related sinusoids which can be modeled by running a periodic pulse train through the linear filter. This allows flexible shaping of the spectral envelope of all harmonics but does not account for their original phase information because the phases are now determined by the phase response of the minimum-phase filter which is strictly coupled to its logarithmic amplitude response (via a Hilbert transform). Furthermore, as there is only one filter for two excitation types (white noise and pulse trains), the model simplifies their additive superposition to a hard switch in the time domain, requiring strict temporal segregation of noise-like and nearly periodic speech (compare the decoder block diagram shown in Figure 15.3). This simplification results in a systematic model mismatch for mixed excitation signals.

Linear Prediction Analysis

The LPC encoder has to estimate the model parameters for every given frame, and to quantize and code them for transmission. We will only discuss estimation of the LP filter parameters here;

estimation of the fundamental periodT0=1/f0 is treated later in this text (Section 15.4.3). As a

Pulse train

White noise

Minimum phase filter

1 A(z)

x[n]

g u[n]

Figure 15.3 Linear predictive vocoder signal generator as used in decoder.

first step, specification of the minimum-phase filter is narrowed down to anall-pole filter – i.e., a filter where all transfer function zeros are concentrated in the origin of the z-plane and where only the poles are used to shape its frequency response. This is justified by two considerations:

• The vocal tract is mainly characterized by its resonance frequencies and our perception is more sensitive to spectral peaks than to valleys.

• For a minimum-phase filter, all the poles and zeros lie inside the unit circle. In this case, a zero at positionz0 with|z0|<1 can be replaced by a geometric series of poles, which converges for

|z| =1:

(1−z0z−1)= 1

1+z0z−1+z02z−2+ ã ã ã (15.2) This allows us to write the signal model in thez-transform domain as

X(z)=U (z)

A(z) (15.3)

where the speech signal is represented byX(z), the excitation signal byU(z), and the filter transfer function by 1/A(z). In the time domain, this reads as

xn= − m

i=1

aixn−i+un (15.4)

where the filter or predictor order is chosen as m=10 for speech sampled at 8 kHz and the parametersai are known aspredictor coefficients, with normalization6a0=1.

For a given speech frame, we estimate model parametersaˆi such that model mismatch is mini- mized. This mismatch can be observed through the prediction error signalen:

en=xn− ˆxn=xn−

− m

i=1

ˆ aixn−i

= m

i=1

(aˆi−ai)xn−i+un (15.5) For uncorrelated excitation un, the prediction error power achieves its minimum iff aˆi=ai, for i=1, . . . , m. In this case, the prediction error signalenbecomes identical to the excitation signalun. Note that we use the prediction framework only for model fitting, not for forecasting or extrapolation of future signal samples. To apply this estimator to short-time stationary speech data, for every frame

6The gaingcan be included in the excitation signal amplitude.

Speech Coding 329

with an update rate ofN samples, a window ofL≥N samples is chosen, where the greater sign means that the windows have some overlap or lookahead. Typically, a special window functionwnis applied to mitigate artifacts due to the data discontinuity introduced by the windowing mechanism.

Asymmetric windows with their peak close to the most recent samples allow a better compromise between estimation accuracy and delay. The window can be applied to the data in two different ways, giving rise to two LP analysis methods:

1. Theautocorrelation method defines the prediction error power estimate based on a windowed speech signal:

ˆ xn=

wnãxn, forn=0,1, . . . , L−1

0, fornoutside the window (15.6)

∞ n=−∞

e2n=

+∞

n=−∞

˜ xn+

m i=1

ˆ aix˜n−i

(15.7) where data windowing results in an implicit limitation of the infinite sum.

2. Thecovariance methoddefines the prediction error power estimate via a windowing of the error signal itself without explicit data windowing:

L−1

n=m

(wnãen)2=

L−1

n=m

xn+ m

i=1

ˆ aixn−i

(15.8) where the summation bounds are carefully chosen to avoid the use of speech samples outside the window.

In both methods, minimization of the quadratic cost function leads to a system of linear equations for the unknown parametersaˆi. In the autocorrelation method, the system matrix turns out to be a proper correlation matrix withToeplitz structure which allows a computationally efficient solution using the order-recursiveLevinson–Durbin algorithm. This algorithm reduces the operation count of the estimator from O(m3) to O(m2) and guarantees that the roots of the polynomial A(z)ˆ lie inside the unit circle. In the covariance method, no such structure can be exploited such that higher complexity and additional mechanisms for stabilizing A(z)ˆ are required. Its advantage is the significantly increased accuracy of estimated coefficients because the absence of explicit data windowing avoids some systematic errors of the autocorrelation method. Finally, to enhance the numerical properties of LP analysis, additional (pre-)processing steps are routinely included such as high-pass prefiltering of input speech to remove unwanted low-frequency components, a bandwidth expansion applied to correlation function estimates, and a correction of the autocorrelation at lag 0 which corresponds to the addition of a weak white noise floor (at−40 dB of the data).

15.3.3 Sinusoidal Modeling

The second implementation of Wold’s decomposition emphasizes the idea of spectral modeling using sinusoids as proposed by MacAulay and Quatieri [1986]. In this model, both signal components are produced by the same sum of sinusoids. The noise-like regular component can well be approximated by such a sum (as suggested by spectral representation theory) if the relative phases of the sinusoids are randomized frequently, at least once for each frame. This scheme offers the advantage of retaining the original phase information in the singular component (if the available bit rate allows us to code it, of course) and it is more flexible in combining regular and

singular components. Typically, even voiced speech contains a noise-like component in the higher frequency bands which can be modeled by using a fixed-phase model for the lower harmonics and a random-phase model for the higher harmonics. This hard switch in the frequency domain assumes the segregation of noise-like and nearly periodic signal components along the frequency axis. While this allows the modeling of mixed excitation speech signals, the transitions between excitation signal types are now a priori restricted to the frame boundaries (whereas the LPC allows in principle higher temporal resolution for voicing switch update).

A further development of sinusoidal modeling is the Multi Band Excitation (MBE) coder which allows separate voicing decisions in multiple frequency bands, thereby achieving a more accurate description of mixed excitation phenomena. This also establishes a relationship to the principles of subband coding or transform coding which do not rely on specific source signal models and which are most popular in coding of generic audio.

15.3.4 Harmonic + Noise Modeling

The third implementation of Wold’s decomposition strives at a full realization of the additive superposition of regular and singular componentswithout enforcing a hard switch in either the time or frequency domains. Thus, the HNM is maybe the simplest in its decoder structure which just follows Eq. (15.1), whereas the encoder is the most difficult as it has to solve the problem of simultaneous estimation of superimposed continuous spectra (the regular component) and discrete spectra (the singular component, assumed to be harmonically related). So far, use of this model has been confined to applications in speech and audio synthesis, whereas for speech coding it has been recognized that the level of sophistication achieved in the HNM-spectral representation needs to be complemented by a more detailed analysis of time domain variations, too, leading us beyond the conventional short-time stationary model.

15.3.5 Cyclostationary Modeling

The short-time stationary model of speech neglects the rapid time variations induced by vocal fold oscillations. For voiced speech, this nearly periodic oscillation not only serves as the main excitation of the vocal tract but also as a cyclic modulation of all signal statistics. Stochastic processes with periodic statistics are known ascyclostationary processes. For our purposes, where the fundamental periodT0and the associated oscillation pattern may slowly evolve over time, we need to adapt this concept to short-time cyclostationary processes. For a given speech frame, the waveform will be decomposed in a cyclic mean signal (corresponding to the singular component of Wold’s decomposition) and a zero-mean noise-like process with periodically time-varying correlation function. Most importantly, the variance of this noise-like component is a periodic function of time representing the periodic envelope modulation of the noise-like component of voiced speech. This effect is clearly audible for lower pitched male voices and constitutes one of the main improve- ments of cyclostationary speech models over HNMs. These models were originally introduced as (Prototype) Waveform Interpolation (PWI) coders by Kleijn and Granzow [1991] where the cyclic mean is interpreted as the “slowly evolving waveform” and the periodically time-varying noise- like component is termed the “rapidly evolving waveform”. More recent developments work with filterbanks whose channels are adapted to multiples of f0 by prewarping the signal in the time domain. A key aspect of cyclostationary signal modeling is the extraction of reliable “pitch marks”

that allow explicit time domainsynchronization of subsequent fundamental periods. As all speech properties evolve over time, this synchronization problem is a major challenge and leads to the characterization of speech signals in terms of an underlying self-oscillating nonlinear dynamical

Speech Coding 331

system with the added benefit of explaining certain nonadditive irregularities of speech oscillations such as jitter or period-doubling phenomena.

Diffraction by a Single Screen or Wedge

Small-Scale Fading with a Dominant Component