Figure 8.1 Essential coding technologiesFigure 8.2 Speech and audio codecs with regard to bitrate regard to usage and bitrate, where adaptive multirate wideband AMR-WB ITU-T 2002 is show
Trang 1Adaptive Reliability Manager Adaptation Policy
Strategy Selector
Strategy Replacement Manager Event Monitor
Mervlet Application
RMS Failure Free Strategy Failure Free Strategy Recovery Strategy
Adapter Adapter
FFI RI
Fault Tolerant Strategies
Event Fired
Fault Tolerance Metadata Strategies
Capabilities
Switch/
Add
Monitor
Figure 7.14 Reliability support in AOE
by getting the current failure-free strategy first and then calling the desired method (e.g.,doPost and doGet) on the strategy Recoverable Mervlets allow the same application to havedifferent fault-tolerance mechanisms during different contexts For example, the Web Mailapplication may be configured to be more reliable for corporate e-mail than personal e-mail.Dynamic reconfigurability support in fault tolerance is achieved by allowing the twomain components, the RMS and the Recoverable Mervlet, to have different failure-free andrecovery strategies, which can be set dynamically by the ARM (shown in Figure 7.14).The separation between failure-free and recovery strategies helps in developing multiplerecovery strategies corresponding to a failure-free strategy For example, in case of RMS,one recovery strategy may prioritize the order in which messages are recovered, whileanother recovery strategy may not
In our current implementation, the adaptability in fault-tolerance support is reflected inthe ability to dynamically switch on and off server-side logging depending on current serverload Under high server load, the ARM can reconfigure the RMS to stop logging on theserver side In some cases, this can result in marked improvement in the client perceivedresponse time
7.7 Conclusions
The evolution of handheld devices clearly indicates that they are becoming highly relevant
in users’ everyday activities Voice transmission still plays a central role but machine interaction is becoming important and it is poised to surpass voice transmission.This data transmission is triggered by digital services running on the phone as well as onthe network that allow users to access data and functionality everywhere and at anytime
Trang 2machine-to-This digital revolution requires a middleware infrastructure to orchestrate the servicesrunning on the handhelds, to interact with remote resources, to discover and announce dataand functionality, to simplify the migration of functionality, and to simplify the development
of applications At DoCoMo Labs USA, we understand that the middleware has to bedesigned to take into account the issues that are specific to handheld devices and that makethem different from traditional servers and workstation computers Examples of these issuesare mobility, limited resources, fault tolerance, and security
DoCoMo Labs USA also understands that software running on handheld devices must
be built in such a way that it can be dynamically modified and inspected without stopping
its execution Systems built according to this requirement are known as reflective systems.
They allow inspecting of their internal state, reasoning about their execution, and introducingchanges whenever required Our goal is to provide an infrastructure to construct systemsthat can be fully assembled at runtime and that explicitly externalize their state, logic, and
architecture We refer to these systems as completely reconfigurable systems.
Trang 3of high-bandwidth connectivity at all times Figure 8.1 illustrates the importance of mediacoding technologies and radio access technologies These are complementary and orthog-onal approaches for improving media quality over mobile networks Thus, media codingtechnologies are essential even in the XG mobile network environment, as discussed inChapter 1.
Speech communication has been the dominant application in the first three generations
of mobile networks 8-kHz sampling has been used for telephony with the adaptive multirate(AMR) (3GPP 1999d) speech codec (encoder and decoder) that is used in 3G networks The8-kHz restriction ensures the interoperability with the legacy wired telephony network Ifthis restriction is removed and peer-to-peer communications with higher audio sampling
is adopted, new media types, such as wideband speech and real-time audio, will becomemore widespread Figure 8.2 illustrates existing speech and audio coding technologies with
Next Generation Mobile Systems. Edited by Dr M Etoh
2005 John Wiley & Sons, Ltd
Trang 4Figure 8.1 Essential coding technologies
Figure 8.2 Speech and audio codecs with regard to bitrate
regard to usage and bitrate, where adaptive multirate wideband (AMR-WB) (ITU-T 2002)
is shown as an example of wideband speech communication, and MPEG-2 of broadcastand storage media Given 44-kHz sampling and a new type of codec that is suitable forreal-time communication, low-latency hi-fi telephony can be achieved and convey morerealistic sounds between users
Video media requires a higher bandwidth in comparison with speech and audio Inthe last decade, video compression technologies have evolved in the series of MPEG-1,MPEG-2, MPEG-4, and H.264, which will be discussed in the following sections Given
Trang 5Figure 8.3 Video codecs with regard to bitrate
a bandwidth of several megabits per second (Mbps), these codecs can transmit quality video Because of the bandwidth gap (even in XG), however, it is important to have
broadcast-a codec thbroadcast-at provides better coding efficiency Figure 8.3 summbroadcast-arizes the typicbroadcast-al existingcodecs and the low-rate hi-fi video codec that is required by mobile applications
This chapter covers the technological progress of the last 10 years and the researchdirected toward more advanced coding technologies Current technologies were designed tominimize implementation costs, such as the cost of memory, and also to be compatible withlegacy hardware architectures Moore’s Law, which states that computing power doublesevery 18 months, has been an important factor in codec evolution As a result of this law,there have been significant advances in technology in the 10 years since the adoption ofMPEG-2 Future coding technologies will need to incorporate advances in signal processinglocal spectral information (LSI) technologies Additional computational complexity is theprinciple driving codec evolution This chapter also covers mobile applications enabled bythe recent progress of coding technologies These are the TV phone, multimedia messagingservices already realized in 3G, and future media-streaming services
8.2 Speech and Audio Coding Technologies
In speech and audio coding, digitized speech or audio signals are represented with as few bits
as possible, while maintaining a reasonable level of perceptual quality This is accomplished
by removing the redundancies and the irrelevancies from the signal Although the objectives
of speech and audio coding are similar, they have evolved along very different paths.Most speech coding standards are developed to handle narrowband speech, that is, digi-tized speech with a sampling frequency of 8 kHz Narrowband speech provides toll qualitysuitable for general-purpose communication and is interoperable with legacy wired tele-phony networks Recent trends focus on wideband speech, which has a sampling frequency
of 16 kHz Wideband speech (50–7000 Hz) provides better quality and improved bility required by more-demanding applications, such as teleconferencing and multimedia
Trang 6intelligi-services Modern speech codecs employ source-filter models to mimic the human soundproduction mechanism (glottis, mouth, and lips).
The goal in audio coding is to provide a perceptually transparent reproduction, meaning
that trained listeners (so-called golden ears) cannot distinguish the original source material
from the compressed audio The goal is not to faithfully reproduce the signal waveform orits spectrum but to reproduce the information that is relevant to human auditory perception.Modern audio codecs employ psychoacoustic principles to model human auditory perception.This section includes an overview of various standardized speech and audio codecs, anexplanation of the relevant issues concerning the advancement of the field, and a description
of the most-promising research directions
8.2.1 Speech Coding Standards
A large number of speech coding standards have been developed over the past three decades.Generally speaking, speech codecs can be divided into three broad categories:
1 Waveform codecs using pulse code modulation (PCM), differential PCM (DPCM), oradaptive DPCM (ADPCM)
2 Parametric codecs using linear prediction coding (LPC) or mixed excitation linearprediction (MELP)
3 Hybrid codecs using variations of the code-excited linear prediction (CELP) algorithm.This subsection describes the essence of these coding technologies, and the standardsthat are based on them Figure 8.4 shows the landmark standards developed for speechcoding
Figure 8.4 Evolution of speech coding standards
Trang 7Waveform Codecs
Waveform codecs attempt to preserve the shape of the signal waveform and were widelyused in early digital communication systems Their operational bitrate is relatively high,which is necessary to maintain acceptable quality
The fundamental scheme for waveform coding is PCM, which is a quantization process
in which samples of the signals are quantized and represented using a fixed number of bits.This scheme has negligible complexity and delay, but a large number of bits is necessary toachieve good quality Speech samples do not have uniform distribution, so it is advantageous
to use nonuniform quantization ITU-T G.711 (ITU-T 1988) is a nonuniform PCM standardrecommended for encoding speech signals, where the nonlinear transfer characteristics ofthe quantizer are fully specified It encodes narrowband speech at 64 kbps
Most speech samples are highly correlated with their neighbors, that is, the sample value
at a given instance is similar to the near past and the near future Therefore, it is possible
to make predictions and remove redundancies, thereby achieving compression DPCM andADPCM use prediction, where the prediction error is quantized and transmitted instead of thesample itself Figure 8.5 shows the block diagrams of a DPCM encoder and decoder ITU-
T G.726 is an ADPCM standard, and incorporates a pole-zero predictor Four operationalbitrates are specified: 40, 32, 24, and 16 kbps (ITU-T 1990) The main difference betweenDPCM and ADPCM is that the latter uses adaptation, where the parameters of the quantizerare adjusted according to the properties of the signal A commonly adapted element is the
Predictor
e[n]
Decoder(Quantizer)
Predictor
Decoder(Quantizer)
eˆ[n]
Figure 8.5 DPCM encoder (top) and decoder (bottom) Reproduced by permission of JohnWiley & Sons, Inc
Trang 8predictor, where changes to its parameters can greatly increase its effectiveness, leading tosubstantial improvement in performance.
The previously described schemes are designed for narrowband signals The ITU-Tstandardized a wideband codec known as G.722 (ITU-T 1986) in 1986 It uses subbandcoding, where the input signal is split into two bands and separately encoded using ADPCM.This codec can operate at bitrates of 48, 56, and 64 kbps and produces good quality forspeech and general audio signals G.722 operating at 64 kbps is often used as a referencefor evaluating new codecs
Parametric Codecs
In parametric codecs, a multiple-parameter model is used to generate speech signals Thistype of codec makes no attempt to preserve the shape of the waveform, and quality of thesynthetic speech is linked to the sophistication of the model A very successful model isbased on linear prediction (LP), where a time-varying filter is used The coefficients of thefilter are derived by an LP analysis procedure (Chu 2003)
The FS-1015 linear prediction coding (LPC) algorithm developed in the early1980s (Tremain 1982) relies on a simple model for speech production (Figure 8.6) derivedfrom practical observations of the properties of speech signals Speech signals may be clas-sified as voiced or unvoiced Voiced signals possess a clear periodic structure in the timedomain, while unvoiced signals are largely random As a result, it is possible to use a two-state model to capture the dynamics of the underlying signal The FS-1015 codec operates at2.4 kbps, where the quality of the synthetic speech is considered low The coefficients of thesynthesis filter are recomputed within short time intervals, resulting in a time-varying filter
A major shortcoming of the LPC model is that misclassification of voiced and unvoiced nals can create annoying artifacts in the synthetic speech; in fact, under many circumstances,the speech signal cannot be strictly classified Thus, many speech coding standards devel-oped after FS-1015 avoid the two-state model to improve the naturalness of the syntheticspeech
White noise generator
Voiced/
unvoiced switch
Synthesis filter
Figure 8.6 The LPC model of speech production Reproduced by permission of John Wiley
& Sons, Inc
Trang 9Impulse responsePitch
Pulseshaping filter
White noise generator
Noise shaping filter
Synthesisfilter
Filtercoefficients
The harmonic vector-excitation codec (HVXC), which is part of the MPEG-4 dard (Nishiguchi and Edler 2002), was designed for narrowband speech and operates ateither 2 or 4 kbps This codec also supports a variable bitrate mode and can operate atbitrates below 2 kbps The HVXC codec is based on the principles of linear prediction, andlike the MELP codec, transmits the spectral shape of the excitation for voiced frames Forunvoiced frames, it employs a mechanism similar to CELP to find the best excitation
stan-Hybrid Codecs
Hybrid codecs combine features of waveform codecs and parametric codecs They use amodel to capture the dynamics of the signal, and attempt to match the synthetic signal tothe original signal in the time domain The code-excited linear prediction (CELP) algorithm
is the best representative of this family of codecs, and many standardized codecs are based
on it Among the core techniques of a CELP codec are the use of long-term and term linear prediction models for speech synthesis, and the incorporation of an excitationcodebook, containing the code to excite the synthesis filters Figure 8.8 shows the block
Trang 10Synthetic speechError
minimization
Excitation codebook
Synthesis filter
Spectralanalysis
Gain calculation
Figure 8.8 Block diagram showing the key components of a CELP encoder Reproduced
by permission of John Wiley & Sons, Inc
diagram of a basic CELP encoder, where the excitation codebook is searched in a loop fashion to locate the best excitation for the synthesis filter, with the coefficients of thesynthesis filter found through an open-loop procedure
closed-The key components of a CELP bitstream are the gain, which contains the power mation of the signal; the filter coefficients, which contain the local spectral information;
infor-an index to the excitation codebook, which contains information related to the excitationwaveform; and the parameters of the long-term predictors, such as a pitch period and anadaptive codebook gain
CELP codecs are best operated in the medium bitrate range of 5–15 kbps They vide higher performance than most low-bitrate parametric codecs because the phase ofthe signal is partially preserved through the encoding of the excitation waveform Thistechnique allows a much better reproduction of plosive sounds, where strong transientsexist
pro-Standardized CELP codecs for narrowband speech include the TIA IS54 excited linear prediction (VSELP) codec, the FS-1016 CELP codec, the ITU-T G.729 (ITU-
vector-sum-T 1995) conjugate-structure algebraic CELP (ACELP) codec, and the AMR codec (3GPP1999d) For wideband speech, the best representatives are the ITU-T G.722.2 AMR-WBcodec (ITU-T 2002) and the MPEG-4 version of CELP (Nishiguchi and Edler 2002).Recent trends in CELP codec design have focused on the development of multimodecodecs They take advantage of the dynamic nature of the speech signal and adapt tothe time-varying network conditions In multimode codecs, one of several distinct codingmodes is selected There are two methods for choosing the coding modes: source con-trol, when it is based on the local properties of the input speech, and network control,when the switching obeys some external commands in response to network or channelconditions An example of a source-controlled multimode codec is the TIA IS96 stan-dard (Chu 2003), which dynamically selects one of four data rates every 20 ms, depending
on speech activity The AMR and AMR-WB standards, on the other hand, are networkcontrolled The AMR standard is a family of eight codecs operating at 12.2, 10.2, 7.95,
Trang 117.40, 6.70, 5.90, 5.15, and 4.75 kbps The selectable mode vocoder (SMV) (3GPP2 2001)
is both network controlled and source controlled It is based on four codecs operating at8.55, 4.0, 2.0, and 0.8 kbps and four network-controlled operating modes Depending onthe selected mode, a different rate-determination algorithm is used, leading to a differentaverage bitrate
In March 2004, the third-generation partnership project (3GPP) adopted AMR-WB+
as a codec for packet-switched streaming (PSS) audio services AMR-WB+ is based onAMR-WB and further includes transform coded excitation (TCX) and parametric cod-ing It also uses a 80-ms superframe to increase coding efficiency The coding delay isaround 130 ms and therefore not suitable for real-time two-way communication applica-tions
Applications and Historical Context
The FS-1015 codec was developed for secure speech over narrowband very high frequency(VHF) channels for military communication The main goal was speech intelligibility, notquality MELP and FS-1016 were developed for the same purpose, but with emphasis onhigher speech quality G.711 is used for digitizing speech in backbone circuit-switchedtelephone networks It is also a mandatory codec for H.323 packet-based multimedia com-munication systems AMR is a mandatory codec for 3G wireless networks For this codec,the speech bitrate varies in accordance with the distance from the base station, or to mitigateelectromagnetic interference AMR was developed for improved speech quality in cellularservices G.722 is used in videoconferencing systems and multimedia, where higher audioquality is required AMR-WB was developed for wideband speech coding in 3G networks.The increased bandwidth of wideband speech (50–7000 Hz) provides more naturalness,presence, and intelligibility G.729 provides near toll-quality performance under clean chan-nel conditions and was developed for mobile voice applications that are interoperable withlegacy public switched telephone networks (PSTN) It is also suitable for voice over Internetprotocol (VoIP)
8.2.2 Principles of Audio Coding
Simply put, speech coding models the speaker’s mouth and audio coding models thelistener’s ear Modern audio codecs, such as MPEG-1 (ISO/IEC 1993b) and MPEG-2(ISO/IEC 1997, 1999), use psychoacoustic models to achieve compression As mentionedbefore, the goal of audio coding is to find a compact description of the signal while main-taining good perceptual quality Unlike speech codecs that try to model the source of thesound (human sound production apparatus), audio codecs try to take advantage of theway the human auditory system perceives sound In other words, they try to model thehuman hearing apparatus No unified source model exists for audio signals In general,audio codecs employ two main principles to accomplish their task: time/frequency analy-sis and psychoacoustics-based quantization Figure 8.9 shows a block diagram of a genericaudio encoder
The encoder uses a frequency-domain representation of the signal to identify the parts ofthe spectrum that play major roles in the perception of sound, and eliminate the perceptually
Trang 12EncodedBit stream
Time/
FrequencyMapping
Quantizationand Coding
Figure 8.9 Generic block diagram of audio encoder
PCM Audio SamplesFrame
Mapping
Encoded
Bit stream
Figure 8.10 Generic block diagram of audio decoder
insignificant parts of the spectrum Figure 8.10 shows the generic block diagram of the audiodecoder The following section describes the various components in these figures
Time/Frequency Analysis
The time/frequency analysis module converts 2 ms to 50 ms long frames of PCM audiosamples (depending on the standard) to equivalent representations in the frequency domain.The number of samples in the frame depends on the sampling frequency, which varies from
16 to 48 kHz depending on the application For example, wideband speech uses a 16-kHzsampling frequency, CD quality music uses 44.1 kHz, and digital audio tape (DAT) uses
48 kHz The purpose of this operation is to map the time-domain signal into a domainwhere the representation is more clustered and compact As an example, a pure tone in thetime domain extends over many time samples, while in the frequency domain, most of theinformation is concentrated in a few transform coefficients The time/frequency analysis inmodern codecs is implemented as a filter bank The number of filters in the bank, theirbandwidths, and their center frequencies depend on the coding scheme For example, theMPEG-1 audio codec (ISO/IEC 1993b) uses 32 equally spaced subband filters Codingefficiency depends on adequately matching the analysis filter bank to the characteristics ofthe input audio signal Filter banks that emulate the analysis properties of the human auditorysystem, such as those that employ subbands resembling the ear’s nonuniform critical bands,have been highly effective in coding nonstationary audio signals Some codecs use time-varying filter banks that adjust to the signal characteristics The modified discrete cosinetransform (MDCT) is a very popular method to implement effective filter banks
Trang 13Modified Discrete Cosine Transform (MDCT)
The MDCT is a linear orthogonal lapped transform, based on the idea of time-domainaliasing cancellation (TDAC) (Princen and Bradley 1987) The MDCT offers two distinctadvantages: (1) it has better energy compaction properties than the FFT, representing themajority of the energy in the sequence with just a few transform coefficients; and (2) ituses overlapped samples to mitigate the artifacts arising in block transforms at the frameboundaries Figure 8.11 illustrates this process Let x(k), k = 0, , 2N − 1, represent the
audio signal and w(k), k = 0, , 2N − 1, a window function of length 2N samples The
MDCT (Ramstat 1991) is defined as:
X(m)=
2
Note that the MDCT uses 2N PCM samples to generate N transform values The
transform is invertible for a symmetric window w(2N − 1 − k) = w(k), as long as the
window function satisfies the Princen–Bradley condition:
Windows applied to the MDCT are different from windows used for other types of signalanalysis, because they must fulfill the Princen–Bradley condition One of the reasons forthis difference is that MDCT windows are applied twice, once for the MDCT and oncefor the inverse MDCT (IMDCT) For MP3 and MPEG-2 AAC, the following sine window
Figure 8.12 shows the frequency response of the human auditory system for pure tones
in a quiet environment The vertical axis in this figure is the threshold of hearing measured
Frame j Frame j+1 Frame j+2 Frame j+3
0 1 ……… 2N-1
N N N N
N ……… 3N-1
MDCT N MDCT NFigure 8.11 MDCT showing 50% overlap in successive frames
Trang 14101 102 103 104 10520
Sensitivity of the Human Auditory System to Single Pure Tones
Figure 8.12 Sensitivity of human auditory system to single pure tones
in units of sound pressure level (SPL) SPL is a measure of sound pressure level in decibelsrelative to a 20-µPa reference in air As seen here, the ear is most sensitive to frequenciesaround 3.5 kHz and not very sensitive to frequencies below 300 Hz or above 10 kHz For
a 2-kHz tone to be barely audible, its level must be at least 0 dB A 100-Hz tone, on theother hand, must have a 22-dB level to be just audible, that is, its amplitude must be tentimes higher than that of the 2-kHz tone Audio codecs take advantage of this phenomenon
by maintaining the quantization noise below this audible threshold
Frequency Masking
The response of the auditory system is nonlinear and the perception of a given tone isaffected by the presence of other tones The auditory channels for different tones interfere
with each other, giving rise to a complex auditory response called frequency masking.
Figure 8.13 illustrates the frequency-masking phenomenon when a 60-dB, 1-kHz tone
is present Superimposed on this figure are the masking threshold curves for 1-kHz and4-kHz tones The masking threshold curves intersect the threshold of the hearing curve attwo points The intersection point on the left is around 600 Hz and the intersection point onthe right is around 4 kHz This means that any tone in the masking band between 400 Hzand 4 kHz with SPL that falls below the 1-kHz masking curve will be overshadowed ormasked by the 1-kHz tone and will not be audible For example, a 2-kHz tone (shown in
Trang 15101 102 103 104 105–20
Figure 8.13 Frequency-masking phenomenon
Figure 8.13) will not be audible unless it is louder than 10 dB In particular, the maskingbandwidth depends on the frequency of the masking tone and its level This is illustrated bythe frequency masking curve for the tone at 4 kHz As seen here, the masking bandwidth islarger for a 4-kHz tone than for a 1-kHz tone If the masking tone is louder than 60 dB, themasking band will be wider; that is a wider range of frequencies around 1 kHz or 4 kHzwill be masked Similarly, if the 1-kHz tone is weaker than 60 dB, the masking band will
be narrower Thus, louder tones will mask more neighboring frequencies than softer tones,which makes intuitive sense So, ignoring (i.e., not storing or not transmitting) the frequencycomponents in the masking band whose levels fall below the masking curve does not causeany perceptual loss
Trang 16Figure 8.14 Temporal masking phenomenon
shows how long it takes for the auditory system to realize that there is a test tone Thisdelay time depends on the level of the test tone The louder the test tone, the sooner the eardetects it In other words, the ear thinks that the masking tone is still there, even though ithas been removed
8.2.3 Audio Coding Standards
Codec design is influenced by coding quality, application constraints (one-way versustwo-way communication, playback, streaming, etc.), signal characteristics, implementationcomplexity, and resiliency to communication errors For example, voice applications, such
as telephony, are constrained by the requirements for natural two-way communication Thismeans that the maximum two-way delay should not exceed 150 ms On the other hand,digital storage, broadcast, and streaming applications do not impose strict requirements oncoding delay This subsection reviews several audio coding standards Figure 8.15 showsvarious speech and audio applications, the corresponding quality, and bitrates
MPEG Audio Coding
The Moving Pictures Experts Group (MPEG) has produced international standards for quality and high-compression perceptual audio coding The activities of this standardizationbody have culminated in a number of successful and popular coding standards The MPEG-1audio standard was completed in 1992 MPEG-2 BC is a backward-compatible extension toMPEG-1 and was finalized in 1994 MPEG-2 AAC is a more efficient audio coding standard.MPEG-4 Audio includes tools for general audio coding and was issued in 1999 Thesestandards support audio encoding for a wide range of data rates MPEG audio standards areused in many applications Table 8.1 summarizes the applications, sampling frequencies,
Trang 17high-Figure 8.15 Applications, data rates, and codecs
Table 8.1 MPEG audio coding standards
MPEG-1
Broadcasting,storage,multimedia, andtelecommunications
32, 44.1, 48 kHz 32–320 kbps
MPEG-2 BC Multichannel audio 16, 22.05, 24, 32,
44.1, 48 kHz 64 kbps/channelMPEG-2 AAC
Digital televisionand high-qualityaudio
16, 22.05, 24, 32,44.1, 48 kHz 48 kbps/channelMPEG-4 AAC Higher quality,
lower latency 8–48 kHz 24–64 kbps/channel
and the bitrates for various MPEG audio coding standards The following provides a briefoverview of these standards
MPEG-1
MPEG-1 Audio (ISO/IEC 1993b) is used in broadcasting, storage, multimedia, and
telecom-munications It consists of three different codecs called Layers I, II, and III and supports
Trang 18bitrates from 32 to 320 kbps The MPEG-1 audio coder takes advantage of the masking phenomenon described previously, in which parts of a signal are not audible because
frequency-of the function frequency-of the human auditory system Sampling rates frequency-of 32, 44.1, and 48 kHz aresupported Layer III (also known as MP3) is the highest complexity mode and is optimizedfor encoding high-quality stereo audio at around 128 kbps It provides near CD-qualityaudio and is very popular because of its combination of high quality and high-compressionratio MPEG-1 supports both fixed and variable bitrate coding
MPEG-2 BC
MPEG-2 was developed for digital television MPEG-2 BC is a backward-compatible sion to MPEG-1 and consists of two extensions: (1) coding at lower sampling frequencies(16, 22.05, and 24 kHz) and (2) multichannel coding including 5.1 surround sound andmultilingual content of up to seven lingual components
exten-MPEG-2 AAC
MPEG-2 Advanced Audio Coding (AAC) is a second-generation audio codec suitable forgeneric stereo and multichannel signals (e.g., 5.1 audio) MPEG-2 AAC is not backwardcompatible with MPEG-1 and achieves transparent stereo quality (indistinguishable sourcefrom output) at 96 kbps AAC consists of three profiles: AAC Main, AAC Low Complexity(AAC-LC), and AAC Scalable Sample Rate (AAC-SSR)
MPEG-4 Low-Delay AAC
MPEG-4 Low-Delay AAC (AAC-LD) has a maximum algorithmic delay of 20 ms and goodquality for all types of audio signals, including speech and music, which makes it suitablefor two-way communication However, unlike speech codecs, the coding quality can beincreased with bitrate, because the codec is not designed around a parametric model Thequality of AAC-LD at 32 kbps is reported to be similar to AAC at 24 kbps At a bitrate
of 64 kbps, AAC-LD provides better quality than MP3 at the same bitrate and comparablequality to that of AAC at 48 kbps
MPEG-4 High Efficiency AAC
MPEG-4 High Efficiency AAC (MPEG-4 HE AAC) provides high-quality audio at lowbitrates It uses spectral band replication (SBR) to achieve excellent stereo quality at 48kbps and high quality at 32 kbps In SBR, the full-band audio spectrum is divided into alow-band and a complementary high-band section The low-band section is encoded usingthe AAC core The high-band section is not coded directly; instead, a small amount ofinformation about this band is transmitted so that the decoder can reconstruct the full-bandaudio spectrum Figure 8.16 illustrates this process
MPEG-4 HE AAC takes advantage of two facts to achieve this level of quality First,the psychoacoustic importance of the high frequencies in audio is usually relatively low.Second, there is a very high correlation between the lower and the higher frequencies of anaudio spectrum
Trang 19Amplitude
Reconstruction by SBR
Figure 8.16 Spectral band replication in MPEG-4 HE ACC audio coder
Enhanced MPEG-4 HE AAC
Enhanced MPEG-4 HE AAC is an extension of MPEG-4 AAC and features a parametricstereo coding tool to further improve coding efficiency The coding delay is around 130 msand, therefore, this codec is not suitable for real-time two-way communication applications
In March 2004, the 3GPP agreed on making the enhanced MPEG-4 HE AAC codec optionalfor PSS audio services
8.2.4 Speech and Audio Coding Issues
This subsection discusses the challenges for enabling mobile hi-fi communication over XGwireless networks and the issues of the existing codecs in meeting these challenges A low-latency hi-fi codec is desirable for high-quality multimedia communication, as shown in thedashed oval in Figure 8.2
Hi-fi communication consists of music and speech sampled at 44.1 kHz and requireshigh bitrates Compression of multimedia content requires a unified codec that can han-dle both speech and audio signals None of the speech and audio codecs discussed in theprevious sections satisfy the requirements of low-latency hi-fi multimedia communication.The major limitation of most speech codecs is that they are highly optimized for speechsignals and therefore lack the flexibility to represent general audio signals On the otherhand, many audio codecs are designed for music distribution and streaming applications,where high delay can be tolerated Voice communication requires low latency, renderingmost audio codecs unsuitable for speech coding Although today’s codecs provide a signif-icant improvement in coding efficiency, their quality is limited at the data rates commonlyseen in wireless networks AMR-WB provides superior speech quality at 16 kbps and haslow latency, but it cannot provide high-quality audio as its performance is optimized forspeech sampled at 16 kHz, not 44.1 kHz MPEG-4 HE AAC provides high-quality audio at
Trang 2024 kbps/channel, but is suitable for broadcast applications, not low-latency communication.The low-delay version of AAC (AAC-LD) provides transparent quality at 64 kbps/channel.Even with the increases in bandwidth promised by XG, this rate is high and more efficientcodecs will be required for XG networks.
The inherent capriciousness of wireless networks, and the fact that media is often ported over unreliable channels, may result in occasional loss of media packets This makesresiliency to packet loss a desirable feature One of the requirements of XG is seamlesscommunication across heterogeneous networks, devices, and access technologies To accom-modate this heterogeneity, media streams have to adapt themselves to the bandwidth anddelay constraints imposed by the various technologies Multimode or scalable codecs canfulfill this requirement Scalability is a feature that allows the decoder to operate with par-tial information from the encoder and is advantageous in heterogeneous and packet-basednetworks, such as the Internet, where variable delay conditions may limit the availabil-ity of a portion of the bitstream The main advantage of scalability is that it eliminatestranscoding
trans-Enhanced multimedia services can benefit from realistic virtual experiences involving 3Dsound Present codecs lack functionality for 3D audio Finally, high-quality playback oversmall loudspeakers used in mobile devices is essential in delivering high-quality content
8.2.5 Further Research
The following enabling technologies are needed to realize low-latency hi-fi mobile nication over XG networks
commu-• Unified speech and audio coding at 44.1 kHz
• Improved audio quality from small loudspeakers in mobile devices
• 3D audio functionalities on mobile devices
Generally speaking, the increase in functionality and performance of future mobile erations will be at the cost of higher complexity The effect of Moore’s Law is expected
gen-to offset that increase The following are the specific research directions gen-to enable thetechnologies mentioned above
Unified Speech and Audio Coding
Today’s mobile devices typically use two codecs: one for speech and one for audio Aunified codec is highly desirable because it greatly simplifies implementation and is morerobust under most real-world conditions
Several approaches have been proposed for unified coding One approach is to use arate speech and audio codecs and switch them according to the property of the signal TheMPEG-4 standard, for example, proposes the use of a signal classification mechanism inwhich a speech codec and an audio codec are switched according to the property of thesignal In particular, the HVXC standard can be used to handle speech while the harmonicand individual lines plus noise (HILN) standard is used to handle music (Herre and Purn-hagen 2002; Nishiguchi and Edler 2002) Even though this combination provides reasonablequality under certain conditions, it is vulnerable to classification errors Further research to