Tài liệu 45 Speech Coding docx

Cox AT&T Labs — Research 45.1 Introduction Examples of Applications•Speech Coder Attributes 45.2 Useful Models for Speech and Hearing The LPC Speech Production Model •Models of Human Pe

Trang 1

Richard V.Cox “Speech Coding.”

2000 CRC Press LLC <http://www.engnetbase.com>.

Trang 2

Speech Coding

Richard V Cox

AT&T Labs — Research

45.1 Introduction

Examples of Applications•Speech Coder Attributes

45.2 Useful Models for Speech and Hearing

The LPC Speech Production Model •Models of Human

Per-ception for Speech Coding

45.3 Types of Speech Coders

Model-Based Speech Coders • Time Domain Waveform-Following Speech Coders •Frequency Domain Waveform-Following Speech Coders

45.4 Current Standards

Current ITU Waveform Signal Coders •ITU Linear Prediction

Analysis-by-Synthesis Speech Coders•Digital Cellular Speech Coding Standards•Secure Voice Standards•Performance

References

45.1 Introduction

Digital speech coding is used in a wide variety of everyday applications that the ordinary person takes for granted, such as network telephony or telephone answering machines By speech coding

we mean a method for reducing the amount of information needed to represent a speech signal for transmission or storage applications For most applications this means using a lossy compression algorithm because a small amount of perceptible degradation is acceptable This section reviews some of the applications, the basic attributes of speech coders, methods currently used for coding, and some of the most important speech coding standards

45.1.1 Examples of Applications

Digital speech transmission is used in network telephony The speech coding used is just sample-by-sample quantization The transmission rate for most calls is fixed at 64 kilobits per second (kb/s) The speech is sampled at 8000 Hz (8 kHz) and a logarithmic 8-bit quantizer is used to represent each sample as one of 256 possible output values International calls over transoceanic cables or satellites are often reduced in bit rate to 32 kb/s in order to boost the capacity of this relatively expensive equipment Digital wireless transmission has already begun In North America, Europe, and Japan there are digital cellular phone systems already in operation with bit rates ranging from 6.7 to 13 kb/s for the speech coders Secure telephony has existed since World War II, based on the first vocoder (Vocoder is a contraction of the words voice coder.) Secure telephony involves first converting the speech to a digital form, then digitally encrypting it and then transmitting it At the receiver, it

is decrypted, decoded, and reconverted back to analog Current videotelephony is accomplished

Trang 3

through digital transmission of both the speech and the video signals An emerging use of speech coders is for simultaneous voice and data In these applications, users exchange data (text, images, FAX, or any other form of digital information) while carrying on a conversation

All of the above examples involve real-time conversations Today we use speech coders for many storage applications that make our lives easier For example, voice mail systems and telephone answering machines allow us to leave messages for others The called party can retrieve the message when they wish, even from halfway around the world The same storage technology can be used to broadcast announcements to many different individuals Another emerging use of speech coding is multimedia Most forms of multimedia involve only one-way communications, so we include them with storage applications Multimedia documents on computers can have snippets of speech as an integral part Capabilities currently exist to allow users to make voice annotations onto documents stored on a personal computer (PC) or workstation

45.1.2 Speech Coder Attributes

Speech coders have attributes that can be placed in four groups: bit rate, quality, complexity, and

delay For a given application, some of these attributes are pre-determined while tradeoffs can be

made among the others For example, the communications channel may set a limit on bit rate, or cost considerations may limit complexity Quality can usually be improved by increasing bit rate or complexity, and sometimes by increasing delay In the following sections, we discuss these attributes

Primarily we will be discussing telephone bandwidth speech This is a slightly nebulous term In

the telephone network, speech is first bandpass filtered from roughly 200 to 3200Hz This is often referred to as 3 kHz speech Speech is sampled at 8 kHz in the telephone network The usual telephone bandwidth filter rolls off to about 35 dB by 4 kHz in order to eliminate the aliasing artifacts caused

by sampling

There is a second bandwidth of interest It is referred to as wideband speech The sampling rate is

doubled to 16 kHz The lowpass filter is assumed to begin rolling off at 7 kHz At the low end, the speech is assumed to be uncontamined by line noise and only the DC component needs to be filtered out Thus, the highpass filter cutoff frequency is 50 Hz When we refer to wideband speech, we mean speech with a bandwidth of 50 to 7000 Hz and a sampling rate of 16 kHz This is also referred to as

7 kHz speech

Bit Rate

Bit rate tells us the degree of compression that the coder achieves Telephone bandwidth speech

is sampled at 8 kHz and digitized with an 8-bit logarithmic quantizer, resulting in a bit rate of 64 kb/s For telephone bandwidth speech coders, we measure the degree of compression by how much the bit rate is lowered from 64 kb/s International telephone network standards currently exist for coders operating from 64 kb/s down to 5.3 kb/s The speech coders for regional cellular standards span the range from 13 to 3.45 kb/s and those for secure telephony span the range from 16 kb/s to 800 b/s Finally, there are proprietary speech coders that are in common use which span the entire range Speech coders need not have a constant bit rate Considerable compression can be gained by not transmitting speech during the silence intervals of a conversation Nor is it necessary to keep the bit rate fixed during the talkspurts of a conversation

Delay

The communication delay of the coder is more important for transmission than for storage applications In real-time conversations, a large communication delay can impose an awkward protocol on talkers Large communication delays of 300 ms or greater are particularly objectionable

to users even if there are no echoes

Trang 4

Most low bit rate speech coders are block coders They encode a block of speech, also known as

a frame, at a time Speech coding delay can be allocated as follows First, there is algorithmic delay

Some coders have an amount of look-ahead or other inherent delays in addition to their frame size.

The sum of frame size and other inherent delays constitutes algorithmic delay The coder requires computation The amount of time required for this is called processing delay It is dependent on the speed of the processor used Other delays in a complete system are the multiplexing delay and the transmission delay

Complexity

The degree of complexity is a determining factor in both the cost and power consumption of a speech coder Cost is almost always a factor in the selection of a speech coder for a given application With the advent of wireless and portable communications, power consumption has also become an important factor Simple scalar quantizers, such as linear or logarithmic PCM, are necessary in any coding system and have the lowest possible complexity

More complex speech coders are first simulated on host processors, then implemented on DSP chips and may later be implemented on special purpose VLSI devices Speed and random access memory (RAM) are the two most important contributing factors of complexity The faster the chip

or the greater the chip size, the greater the cost In fact, complexity is a determining factor for both cost and power consumption Generally 1 word of RAM takes up as much on-chip area as 4 to 6 words of read only memory (ROM) Most speech coders are implemented on fixed point DSP chips,

so one way to compare the complexity of coders is to measure their speed and memory requirements when efficiently implemented on commercially available fixed point DSP chips

DSP chips are available in both 16-bit fixed point and 32-bit floating point 16-bit DSP chips are generally preferred for dedicated speech coder implementations because the chips are usually less expensive and consume less power than implementations based on floating point DSPs A disadvantage of fixed-point DSP chips is that the speech coding algorithm must be implemented using 16-bit arithmetic As part of the implementation process, a representation must be selected for each and every variable Some can be represented in a fixed format, some in block floating point, and still others may require double precision As VLSI technology has advanced, fixed point DSP chips contain a richer set of instructions to handle the data manipulations required to implement representations such as block floating point The advantage of floating point DSP chips is that implementing speech coders is much quicker Their arithmetic precision is about the same as that

of a high level language simulation, so the steps of determining the representation of each and every variable and how these representations affect performance can be omitted

Quality

The attribute of quality has many dimensions Ultimately quality is determined by how the speech sounds to a listener Some of the factors that affect the performance of a coder are whether the input speech is clean or noisy, whether the bit stream has been corrupted by errors, and whether multiple encodings have taken place

Speech coder quality ratings are determined by means of subjective listening tests The listening

is done in a quiet booth and may use specified telephone handsets, headphones, or loudspeakers The speech material is presented to the listeners at specified levels and is originally prepared to have particular frequency characteristics The most often used test is the absolute category rating (ACR)

test Subjects hear pairs of sentences and are asked to give one of the following ratings: excellent,

good, fair, poor, or bad A typical test contains a variety of different talkers and a number of different

coders or reference conditions The data resulting from this test can be analyzed in many ways The simplest way is to assign a numerical ranking to each response, giving a 5 to the best possible rating,

4 to the next best, down to a 1 for the worst rating, then computing the mean rating for each of the

Trang 5

conditions under test This is a referred to as a mean opinion score (MOS) and the ACR test is often referred to as a MOS test

There are many other dimensions to quality besides those pertaining to noiseless channels Bit error sensitivity is another aspect of quality For some low bit rate applications such as secure telephones over 2.4 or 4.8 kb/s modems, it might be reasonable to expect the distribution of bit errors to be random and coders should be made robust for low random bit error rates up to 1 to 2% For radio channels, such as in digital cellular telephony, provision is made for additional bits to be used for channel coding to protect the information bearing bits Errors are more likely to occur in bursts and the speech coder requires a mechanism to recover from an entire lost frame This is referred to as frame erasure concealment, another aspect of quality for cellular speech coders

For the purposes of conserving bandwidth, voice activity detectors are sometimes used with speech coders During non-speech intervals, the speech coder bit stream is discontinued At the receiver

“comfort noise” is injected to simulate the background acoustic noise at the encoder This method

is used for some cellular systems and also in digital speech interpolation (DSI) systems to increase the effective number of channels or circuits Most international phone calls carried on undersea cables or satellites use DSI systems There is some impact on quality when these techniques are used Subjective testing can determine the degree of degradation

45.2 Useful Models for Speech and Hearing

45.2.1 The LPC Speech Production Model

Human speech is produced in the vocal tract by a combination of the vocal cords in the glottis interacting with the articulators of the vocal tract The vocal tract can be approximated as a tube

of varying diameter The shape of the tube gives rise to resonant frequencies called formants Over the years, the most successful speech coding techniques have been based on linear prediction coding (LPC) The LPC model is derived from a mathematical approximation to the vocal tract representation

as a variable diameter tube The essential element of LPC is the linear prediction filter This is an all pole filter which predicts the value of the next sample based on a linear combination of previous samples

Letx nbe the speech sample value at sampling instantn The object is to find a set of prediction

coefficients{a i } such that the prediction error for a frame of size M is minimized:

ε = M−1X m=0

I

X

i=1

a i x n+m−i + x n+m

!2

(45.1)

whereI is the order of the linear prediction model The prediction value for x nis given by

˜x n= −XI

i=1

The prediction error signal{e n } is also referred to as the residual signal In z-transform notation

we can write

A(z) = 1 +XI

i=1

1/A(z) is referred to as the LPC synthesis filter and (ironically) A(z) is referred to as the LPC inverse

filter

Trang 6

LPC analysis is carried out as a block process on a frame of speech The most often used techniques are referred to as the autocorrelation and the autocovariance methods [1]–[3] Both methods involve inverting matrices containing correlation statistics of the speech signal If the poles of the LPC filter are close to the unit circle, then these matrices become more ill-conditioned, which means that the techniques used for inversion are more sensitive to errors caused by finite numerical precision Various techniques for dealing with this aspect of LPC analysis include windows for the data [1,2], windows for the correlation statistics [4], and bandwidth expansion of the LPC coefficients For forward adaptive coders, the LPC information must also be quantized and transmitted or stored Direct quantization of LPC coefficients is not efficient A small quantization error in a single coefficient can render the entire LPC filter unstable Even if the filter is stable, sufficient precision is required and too many bits will be needed Instead, it is better to transform the LPC coefficients to another domain in which stability is more easily determined and fewer bits are required for representing the quantization levels

The first such domain to be considered is the reflection coefficient [5] Reflection coefficients are computed as a byproduct of LPC analysis One of their properties is that all reflection coefficients must have magnitudes less than 1, making stability easily verified Direct quantization of reflection coefficients is still not efficient because the sensitivity of the LPC filter to errors is much greater when reflection coefficients are nearly 1 or−1 More efficient quantizers have been designed by transforming the individual reflection coefficients with a nonlinearity that makes the error sensitivity more uniform Two such nonlinear functions are the inverse sine function, arcsin(k i ), and the

logarithm of the area ratio, log1+k i

1−k i

A second domain that has attracted even greater interest recently is the line spectral frequency (LSF) domain [6] The transformation is given as follows We first useA(z) to define two polynomials:

P (z) = A(z) + z −(I+1) Az−1 (45.4a)

Q(z) = A(z) − z −(I+1) Az−1 (45.4b)

These polynomials can be shown to have two useful properties: all zeroes ofP (z) and Q(z) lie on

the unit circle and they are interlaced with each other Thus, stability is easily checked by assuring both the interlaced property and that no two zeroes are too close together A second property

is that the frequencies tend to be clustered near the formant frequencies; the closer together two LSFs are, the sharper the formant LSFs have attracted more interest recently because they typically result in quantizers having either better representations or using fewer bits than reflection coefficient quantizers

The simplest quantizers are scalar quantizers [8] Each of the values (in whatever domain is being used to represent the LPC coefficients) is represented by one of the possible quantizer levels The individual values are quantized independently of each other There may also be additional redundancy between successive frames, especially during stationary speech In such cases, values may be quantized differentially between frames

A more efficient, but also more complex, method of quantization is called vector quantization [9]

In this technique, the complete set of values is quantized jointly The actual set of values is compared against all sets in the codebook using a distance metric The set that is nearest is selected In practice,

an exhaustive codebook search is too complex For example, a 10-bit codebook has 1024 entries This seems like a practical limit for most codebooks, but does not give sufficient performance for typical 10th order LPC A 20-bit codebook would give increased performance, but would contain over 1 million vectors This is both too much storage and too much computational complexity to

be practical Instead of using large codebooks, product codes are used In one technique, an initial codebook is used, then the remaining error vector is quantized by a second stage codebook In the

Trang 7

second technique, the vector is sub-divided and each sub-vector is quantized using its own codebook Both of these techniques lose efficiency compared to a full-search vector quantizer, but represent a good means for reducing computational complexity and codebook size for bit rate or quality

45.2.2 Models of Human Perception for Speech Coding

Our ears have a limited dynamic range that depends on both the level and the frequency content of the input signal The typical bandpass telephone filter has a stopband of only about 35 dB Also, the logarithmic quantizer characteristics specified by CCITT Rec G.711 result in a signal-to-quantization noise ratio of about 35 dB Is this a coincidence? Of course not! If a signal maintains an SNR of about

35 dB or greater for telephone bandwidth, then most humans will perceive little or no noise Conceptually, the masking property tells us that we can permit greater amounts of noise in and near the formant regions and that noise will be most audible in the spectral valleys If we use a coder that produces a white noise characteristic, then the noise spectrum is flat The white noise would probably be audible in all but the formant regions

In modern speech coders, an additional linear filter is added to weight the difference between the original speech signal and the synthesized signal The object is to minimize the error in a space whose metric is like that of the human auditory system If the LPC filter information is available, it constitutes the best available estimate of the speech spectrum It can be used to form the basis for this “perceptual weighting filter” [10] The perceptual weighting filter is given by

W(z) = 1− A(z/γ1)

The perceptual weighting filter de-emphasizes the importance of noise in the formant region and emphasizes its importance in spectral valleys The quantization noise will have a spectral shape that

is similar to that of the LPC spectral estimate, making it easier to mask

The adaptive postfilter is an additional linear filter that is combined with the synthesis filter to reduce noise in the spectral valleys [11] Once again the LPC synthesis filter is available as the estimate

of the speech spectrum As in the perceptual weighting filter, the synthesis filter is modified This idea was later further extended to include a long-term (pitch) filter A tilt-compensation filter was added to correct for the low pass characteristic that causes a muffled sound A gain control strategy helped prevent any segments from being either too loud or too soft Adaptive postfilters are now included as a part of many standards

45.3 Types of Speech Coders

This part of the section describes a variety of speech coders that are widely used They are divided into two categories: waveform-following coders and model-based coders Waveform-following coders have the property that if there were no quantization error, the original speech signal would be exactly reproduced Model-based coders are based on parametric models of speech production Only the values of the parameters are quantized If there were no quantization error, the reproduced signal would not be the original speech

45.3.1 Model-Based Speech Coders

LPC Vocoders

A block diagram of the LPC vocoder is shown in Fig.45.1 LPC analysis is performed on a frame

of speech and the LPC information is quantized and transmitted A voiced/unvoiced determination

is made The decision may be based on either the original speech or the LPC residual signal, but it

Trang 8

will always be based on the degree of periodicity of the signal If the frame is classified as unvoiced, the excitation signal is white noise If the frame is voiced, the pitch period is transmitted and the excitation signal is a periodic pulse train In either case, the amplitude of the output signal is selected such that its power matches that of the original speech For more information on the LPC vocoder, the reader is referred to [12]

FIGURE 45.1: Block diagram of LPC vocoder

Multiband Excitation (MBE) Coders

Figure45.2is a block diagram of a multiband sinusoidal excitation coder The basic premise of these coders is that the speech waveform can be modeled as a combination of harmonically related sinusoidal waveforms and narrowband noise Within a given bandwidth, the speech is classified as periodic or aperiodic Harmonically related sinusoids are used to generate the periodic components and white noise is used to generate the aperiodic components Rather than transmitting a single voiced/unvoiced decision, a frame consists of a number of voiced/unvoiced decisions corresponding

to the different bands In addition, the spectral shape and gain must be transmitted to the receiver LPC may or may not be used to quantize the spectral shape Most often the analysis of the encoder

is performed via fast Fourier transform (FFT) Synthesis at the decoder is usually performed by a number of parallel sinusoid and white noise generators MBE coders are model-based because they

do not transmit the phase of the sinusoids, nor do they attempt to capture anything more than the energy of the aperiodic components For more information the reader is referred to [13]–[16]

FIGURE 45.2: Block diagram of multiband excitation coder

Trang 9

Waveform Interpolation Coders

Figure45.3is a block diagram of a waveform interpolation coder In this coder, the speech

is assumed to be composed of a slowly evolving periodic waveform (SEW) and a rapidly evolving noise-like waveform (REW) A frame is analyzed first to extract a “characteristic waveform” The evolution of these waveforms is filtered to separate the REW from the SEW REW updates are made several times more often than SEW updates The LPC, the pitch, the spectra of the SEW and REW, and the overall energy are all transmitted independently At the receiver a parametric representation

of the SEW and REW information is constructed, summed, and passed through the LPC synthesis filter to produce output speech For more information the reader is referred to [17,18]

FIGURE 45.3: Block diagram of waveform interpolation coder

45.3.2 Time Domain Waveform-Following Speech Coders

All of the time domain waveform coders described in this section include a prediction filter We begin with the simplest

Adaptive Differential Pulse Code Modulation (ADPCM)

Adaptive differential pulse code modulation (ADPCM) [19] is based on sample-by-sample quantization of the prediction error A simple block diagram is shown in Fig.45.4 Two parts of the coder may be adaptive: the quantizer step-size and/or the prediction filter ITU Recommendations G.726 and G.727 adapt both The adaptation may be either forward or backward adaptive In a backward adaptive system, the adaptation is based only on the previously quantized sample values and the quantizer codewords At the receiver, the backward adaptive parameter values must be recomputed An important feature of such adaptation schemes is that they must use predictors that

include a leakage factor that allows the effects of erroneous values caused by channel errors to die out

over time In a forward adaptive system, the adapted values are quantized and transmitted This additional “side information” uses bit rate, but can improve quality Additionally, it does not require recomputation at the decoder

Delta Modulation Coders

In delta modulation coders [20], the quantizer is just the sign bit The quantization step size is adaptive Not all the adaptation schemes used for ADPCM will work for delta modulation because the quantization is so coarse The quality of delta modulation coders tends to be proportional to their sampling clock: the greater the sampling clock, the greater the correlation between successive samples, and the finer the quantization step size that can be used The block diagram for delta modulation is the same as that of ADPCM

Trang 10

FIGURE 45.4: ADPCM encoder and decoder block diagrams.

Adaptive Predictive Coding

The better the performance of the prediction filter, the lower the bit rate needed to encode a speech signal This is the basis of the adaptive predictive coder [21] shown in Fig.45.5 A forward adaptive higher order linear prediction filter is used The speech is quantized on a frame-by-frame basis In this way the bit rate for the excitation can be reduced compared to an equivalent quality ADPCM coder

FIGURE 45.5: Adaptive predictive coding encoder and decoder

Linear Prediction Analysis-by-Synthesis Speech Coders

Figure45.6shows a typical linear prediction analysis-by-synthesis speech coder [22] Like APC, these are frame-by-frame coders They begin with an LPC analysis Typically the LPC information is forward adaptive, but there are exceptions LPAS coders borrow the concept from ADPCM of having

a locally available decoder The difference between the quantized output signal and the original signal

is passed through a perceptual weighting filter Possible excitation signals are considered and the best (minimum mean square error in the perceptual domain) is selected The long-term prediction filter removes long-term correlation (the pitch structure) in the signal If pitch structure is present in the coder, the parameters for the long-term predictor are determined first The most commonly used system is the adaptive codebook, where samples from previous excitation sequences are stored The pitch period and gain that result in the greatest reduction of perceptual error are selected, quantized, and transmitted The fixed codebook excitation is next considered and, again, the excitation vector

Tiêu đề	Speech Coding
Tác giả	Richard V. Cox
Trường học	CRC Press LLC
Chuyên ngành	Speech Coding
Thể loại	Essay
Năm xuất bản	2000
Thành phố	New York

Định dạng
Số trang	20
Dung lượng	177,72 KB