1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Next Generation Mobile Systems 3G and Beyond phần 7 pdf

41 205 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Multimedia Coding Technologies and Applications
Tác giả Minoru Etoh, Frank Bossen, Wai Chu, Khosrow Lashkari
Trường học DoCoMo Labs USA
Chuyên ngành Next Generation Mobile Systems
Thể loại bài báo
Định dạng
Số trang 41
Dung lượng 484,14 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Figure 8.1 Essential coding technologiesFigure 8.2 Speech and audio codecs with regard to bitrate regard to usage and bitrate, where adaptive multirate wideband AMR-WB ITU-T 2002 is show

Trang 1

Adaptive Reliability Manager Adaptation Policy

Strategy Selector

Strategy Replacement Manager Event Monitor

Mervlet Application

RMS Failure Free Strategy Failure Free Strategy Recovery Strategy

Adapter Adapter

FFI RI

Fault Tolerant Strategies

Event Fired

Fault Tolerance Metadata Strategies

Capabilities

Switch/

Add

Monitor

Figure 7.14 Reliability support in AOE

by getting the current failure-free strategy first and then calling the desired method (e.g.,doPost and doGet) on the strategy Recoverable Mervlets allow the same application to havedifferent fault-tolerance mechanisms during different contexts For example, the Web Mailapplication may be configured to be more reliable for corporate e-mail than personal e-mail.Dynamic reconfigurability support in fault tolerance is achieved by allowing the twomain components, the RMS and the Recoverable Mervlet, to have different failure-free andrecovery strategies, which can be set dynamically by the ARM (shown in Figure 7.14).The separation between failure-free and recovery strategies helps in developing multiplerecovery strategies corresponding to a failure-free strategy For example, in case of RMS,one recovery strategy may prioritize the order in which messages are recovered, whileanother recovery strategy may not

In our current implementation, the adaptability in fault-tolerance support is reflected inthe ability to dynamically switch on and off server-side logging depending on current serverload Under high server load, the ARM can reconfigure the RMS to stop logging on theserver side In some cases, this can result in marked improvement in the client perceivedresponse time

7.7 Conclusions

The evolution of handheld devices clearly indicates that they are becoming highly relevant

in users’ everyday activities Voice transmission still plays a central role but machine interaction is becoming important and it is poised to surpass voice transmission.This data transmission is triggered by digital services running on the phone as well as onthe network that allow users to access data and functionality everywhere and at anytime

Trang 2

machine-to-This digital revolution requires a middleware infrastructure to orchestrate the servicesrunning on the handhelds, to interact with remote resources, to discover and announce dataand functionality, to simplify the migration of functionality, and to simplify the development

of applications At DoCoMo Labs USA, we understand that the middleware has to bedesigned to take into account the issues that are specific to handheld devices and that makethem different from traditional servers and workstation computers Examples of these issuesare mobility, limited resources, fault tolerance, and security

DoCoMo Labs USA also understands that software running on handheld devices must

be built in such a way that it can be dynamically modified and inspected without stopping

its execution Systems built according to this requirement are known as reflective systems.

They allow inspecting of their internal state, reasoning about their execution, and introducingchanges whenever required Our goal is to provide an infrastructure to construct systemsthat can be fully assembled at runtime and that explicitly externalize their state, logic, and

architecture We refer to these systems as completely reconfigurable systems.

Trang 3

of high-bandwidth connectivity at all times Figure 8.1 illustrates the importance of mediacoding technologies and radio access technologies These are complementary and orthog-onal approaches for improving media quality over mobile networks Thus, media codingtechnologies are essential even in the XG mobile network environment, as discussed inChapter 1.

Speech communication has been the dominant application in the first three generations

of mobile networks 8-kHz sampling has been used for telephony with the adaptive multirate(AMR) (3GPP 1999d) speech codec (encoder and decoder) that is used in 3G networks The8-kHz restriction ensures the interoperability with the legacy wired telephony network Ifthis restriction is removed and peer-to-peer communications with higher audio sampling

is adopted, new media types, such as wideband speech and real-time audio, will becomemore widespread Figure 8.2 illustrates existing speech and audio coding technologies with

Next Generation Mobile Systems. Edited by Dr M Etoh

 2005 John Wiley & Sons, Ltd

Trang 4

Figure 8.1 Essential coding technologies

Figure 8.2 Speech and audio codecs with regard to bitrate

regard to usage and bitrate, where adaptive multirate wideband (AMR-WB) (ITU-T 2002)

is shown as an example of wideband speech communication, and MPEG-2 of broadcastand storage media Given 44-kHz sampling and a new type of codec that is suitable forreal-time communication, low-latency hi-fi telephony can be achieved and convey morerealistic sounds between users

Video media requires a higher bandwidth in comparison with speech and audio Inthe last decade, video compression technologies have evolved in the series of MPEG-1,MPEG-2, MPEG-4, and H.264, which will be discussed in the following sections Given

Trang 5

Figure 8.3 Video codecs with regard to bitrate

a bandwidth of several megabits per second (Mbps), these codecs can transmit quality video Because of the bandwidth gap (even in XG), however, it is important to have

broadcast-a codec thbroadcast-at provides better coding efficiency Figure 8.3 summbroadcast-arizes the typicbroadcast-al existingcodecs and the low-rate hi-fi video codec that is required by mobile applications

This chapter covers the technological progress of the last 10 years and the researchdirected toward more advanced coding technologies Current technologies were designed tominimize implementation costs, such as the cost of memory, and also to be compatible withlegacy hardware architectures Moore’s Law, which states that computing power doublesevery 18 months, has been an important factor in codec evolution As a result of this law,there have been significant advances in technology in the 10 years since the adoption ofMPEG-2 Future coding technologies will need to incorporate advances in signal processinglocal spectral information (LSI) technologies Additional computational complexity is theprinciple driving codec evolution This chapter also covers mobile applications enabled bythe recent progress of coding technologies These are the TV phone, multimedia messagingservices already realized in 3G, and future media-streaming services

8.2 Speech and Audio Coding Technologies

In speech and audio coding, digitized speech or audio signals are represented with as few bits

as possible, while maintaining a reasonable level of perceptual quality This is accomplished

by removing the redundancies and the irrelevancies from the signal Although the objectives

of speech and audio coding are similar, they have evolved along very different paths.Most speech coding standards are developed to handle narrowband speech, that is, digi-tized speech with a sampling frequency of 8 kHz Narrowband speech provides toll qualitysuitable for general-purpose communication and is interoperable with legacy wired tele-phony networks Recent trends focus on wideband speech, which has a sampling frequency

of 16 kHz Wideband speech (50–7000 Hz) provides better quality and improved bility required by more-demanding applications, such as teleconferencing and multimedia

Trang 6

intelligi-services Modern speech codecs employ source-filter models to mimic the human soundproduction mechanism (glottis, mouth, and lips).

The goal in audio coding is to provide a perceptually transparent reproduction, meaning

that trained listeners (so-called golden ears) cannot distinguish the original source material

from the compressed audio The goal is not to faithfully reproduce the signal waveform orits spectrum but to reproduce the information that is relevant to human auditory perception.Modern audio codecs employ psychoacoustic principles to model human auditory perception.This section includes an overview of various standardized speech and audio codecs, anexplanation of the relevant issues concerning the advancement of the field, and a description

of the most-promising research directions

8.2.1 Speech Coding Standards

A large number of speech coding standards have been developed over the past three decades.Generally speaking, speech codecs can be divided into three broad categories:

1 Waveform codecs using pulse code modulation (PCM), differential PCM (DPCM), oradaptive DPCM (ADPCM)

2 Parametric codecs using linear prediction coding (LPC) or mixed excitation linearprediction (MELP)

3 Hybrid codecs using variations of the code-excited linear prediction (CELP) algorithm.This subsection describes the essence of these coding technologies, and the standardsthat are based on them Figure 8.4 shows the landmark standards developed for speechcoding

Figure 8.4 Evolution of speech coding standards

Trang 7

Waveform Codecs

Waveform codecs attempt to preserve the shape of the signal waveform and were widelyused in early digital communication systems Their operational bitrate is relatively high,which is necessary to maintain acceptable quality

The fundamental scheme for waveform coding is PCM, which is a quantization process

in which samples of the signals are quantized and represented using a fixed number of bits.This scheme has negligible complexity and delay, but a large number of bits is necessary toachieve good quality Speech samples do not have uniform distribution, so it is advantageous

to use nonuniform quantization ITU-T G.711 (ITU-T 1988) is a nonuniform PCM standardrecommended for encoding speech signals, where the nonlinear transfer characteristics ofthe quantizer are fully specified It encodes narrowband speech at 64 kbps

Most speech samples are highly correlated with their neighbors, that is, the sample value

at a given instance is similar to the near past and the near future Therefore, it is possible

to make predictions and remove redundancies, thereby achieving compression DPCM andADPCM use prediction, where the prediction error is quantized and transmitted instead of thesample itself Figure 8.5 shows the block diagrams of a DPCM encoder and decoder ITU-

T G.726 is an ADPCM standard, and incorporates a pole-zero predictor Four operationalbitrates are specified: 40, 32, 24, and 16 kbps (ITU-T 1990) The main difference betweenDPCM and ADPCM is that the latter uses adaptation, where the parameters of the quantizerare adjusted according to the properties of the signal A commonly adapted element is the

Predictor

e[n]

Decoder(Quantizer)

Predictor

Decoder(Quantizer)

eˆ[n]

Figure 8.5 DPCM encoder (top) and decoder (bottom) Reproduced by permission of JohnWiley & Sons, Inc

Trang 8

predictor, where changes to its parameters can greatly increase its effectiveness, leading tosubstantial improvement in performance.

The previously described schemes are designed for narrowband signals The ITU-Tstandardized a wideband codec known as G.722 (ITU-T 1986) in 1986 It uses subbandcoding, where the input signal is split into two bands and separately encoded using ADPCM.This codec can operate at bitrates of 48, 56, and 64 kbps and produces good quality forspeech and general audio signals G.722 operating at 64 kbps is often used as a referencefor evaluating new codecs

Parametric Codecs

In parametric codecs, a multiple-parameter model is used to generate speech signals Thistype of codec makes no attempt to preserve the shape of the waveform, and quality of thesynthetic speech is linked to the sophistication of the model A very successful model isbased on linear prediction (LP), where a time-varying filter is used The coefficients of thefilter are derived by an LP analysis procedure (Chu 2003)

The FS-1015 linear prediction coding (LPC) algorithm developed in the early1980s (Tremain 1982) relies on a simple model for speech production (Figure 8.6) derivedfrom practical observations of the properties of speech signals Speech signals may be clas-sified as voiced or unvoiced Voiced signals possess a clear periodic structure in the timedomain, while unvoiced signals are largely random As a result, it is possible to use a two-state model to capture the dynamics of the underlying signal The FS-1015 codec operates at2.4 kbps, where the quality of the synthetic speech is considered low The coefficients of thesynthesis filter are recomputed within short time intervals, resulting in a time-varying filter

A major shortcoming of the LPC model is that misclassification of voiced and unvoiced nals can create annoying artifacts in the synthetic speech; in fact, under many circumstances,the speech signal cannot be strictly classified Thus, many speech coding standards devel-oped after FS-1015 avoid the two-state model to improve the naturalness of the syntheticspeech

White noise generator

Voiced/

unvoiced switch

Synthesis filter

Figure 8.6 The LPC model of speech production Reproduced by permission of John Wiley

& Sons, Inc

Trang 9

Impulse responsePitch

Pulseshaping filter

White noise generator

Noise shaping filter

Synthesisfilter

Filtercoefficients

The harmonic vector-excitation codec (HVXC), which is part of the MPEG-4 dard (Nishiguchi and Edler 2002), was designed for narrowband speech and operates ateither 2 or 4 kbps This codec also supports a variable bitrate mode and can operate atbitrates below 2 kbps The HVXC codec is based on the principles of linear prediction, andlike the MELP codec, transmits the spectral shape of the excitation for voiced frames Forunvoiced frames, it employs a mechanism similar to CELP to find the best excitation

stan-Hybrid Codecs

Hybrid codecs combine features of waveform codecs and parametric codecs They use amodel to capture the dynamics of the signal, and attempt to match the synthetic signal tothe original signal in the time domain The code-excited linear prediction (CELP) algorithm

is the best representative of this family of codecs, and many standardized codecs are based

on it Among the core techniques of a CELP codec are the use of long-term and term linear prediction models for speech synthesis, and the incorporation of an excitationcodebook, containing the code to excite the synthesis filters Figure 8.8 shows the block

Trang 10

Synthetic speechError

minimization

Excitation codebook

Synthesis filter

Spectralanalysis

Gain calculation

Figure 8.8 Block diagram showing the key components of a CELP encoder Reproduced

by permission of John Wiley & Sons, Inc

diagram of a basic CELP encoder, where the excitation codebook is searched in a loop fashion to locate the best excitation for the synthesis filter, with the coefficients of thesynthesis filter found through an open-loop procedure

closed-The key components of a CELP bitstream are the gain, which contains the power mation of the signal; the filter coefficients, which contain the local spectral information;

infor-an index to the excitation codebook, which contains information related to the excitationwaveform; and the parameters of the long-term predictors, such as a pitch period and anadaptive codebook gain

CELP codecs are best operated in the medium bitrate range of 5–15 kbps They vide higher performance than most low-bitrate parametric codecs because the phase ofthe signal is partially preserved through the encoding of the excitation waveform Thistechnique allows a much better reproduction of plosive sounds, where strong transientsexist

pro-Standardized CELP codecs for narrowband speech include the TIA IS54 excited linear prediction (VSELP) codec, the FS-1016 CELP codec, the ITU-T G.729 (ITU-

vector-sum-T 1995) conjugate-structure algebraic CELP (ACELP) codec, and the AMR codec (3GPP1999d) For wideband speech, the best representatives are the ITU-T G.722.2 AMR-WBcodec (ITU-T 2002) and the MPEG-4 version of CELP (Nishiguchi and Edler 2002).Recent trends in CELP codec design have focused on the development of multimodecodecs They take advantage of the dynamic nature of the speech signal and adapt tothe time-varying network conditions In multimode codecs, one of several distinct codingmodes is selected There are two methods for choosing the coding modes: source con-trol, when it is based on the local properties of the input speech, and network control,when the switching obeys some external commands in response to network or channelconditions An example of a source-controlled multimode codec is the TIA IS96 stan-dard (Chu 2003), which dynamically selects one of four data rates every 20 ms, depending

on speech activity The AMR and AMR-WB standards, on the other hand, are networkcontrolled The AMR standard is a family of eight codecs operating at 12.2, 10.2, 7.95,

Trang 11

7.40, 6.70, 5.90, 5.15, and 4.75 kbps The selectable mode vocoder (SMV) (3GPP2 2001)

is both network controlled and source controlled It is based on four codecs operating at8.55, 4.0, 2.0, and 0.8 kbps and four network-controlled operating modes Depending onthe selected mode, a different rate-determination algorithm is used, leading to a differentaverage bitrate

In March 2004, the third-generation partnership project (3GPP) adopted AMR-WB+

as a codec for packet-switched streaming (PSS) audio services AMR-WB+ is based onAMR-WB and further includes transform coded excitation (TCX) and parametric cod-ing It also uses a 80-ms superframe to increase coding efficiency The coding delay isaround 130 ms and therefore not suitable for real-time two-way communication applica-tions

Applications and Historical Context

The FS-1015 codec was developed for secure speech over narrowband very high frequency(VHF) channels for military communication The main goal was speech intelligibility, notquality MELP and FS-1016 were developed for the same purpose, but with emphasis onhigher speech quality G.711 is used for digitizing speech in backbone circuit-switchedtelephone networks It is also a mandatory codec for H.323 packet-based multimedia com-munication systems AMR is a mandatory codec for 3G wireless networks For this codec,the speech bitrate varies in accordance with the distance from the base station, or to mitigateelectromagnetic interference AMR was developed for improved speech quality in cellularservices G.722 is used in videoconferencing systems and multimedia, where higher audioquality is required AMR-WB was developed for wideband speech coding in 3G networks.The increased bandwidth of wideband speech (50–7000 Hz) provides more naturalness,presence, and intelligibility G.729 provides near toll-quality performance under clean chan-nel conditions and was developed for mobile voice applications that are interoperable withlegacy public switched telephone networks (PSTN) It is also suitable for voice over Internetprotocol (VoIP)

8.2.2 Principles of Audio Coding

Simply put, speech coding models the speaker’s mouth and audio coding models thelistener’s ear Modern audio codecs, such as MPEG-1 (ISO/IEC 1993b) and MPEG-2(ISO/IEC 1997, 1999), use psychoacoustic models to achieve compression As mentionedbefore, the goal of audio coding is to find a compact description of the signal while main-taining good perceptual quality Unlike speech codecs that try to model the source of thesound (human sound production apparatus), audio codecs try to take advantage of theway the human auditory system perceives sound In other words, they try to model thehuman hearing apparatus No unified source model exists for audio signals In general,audio codecs employ two main principles to accomplish their task: time/frequency analy-sis and psychoacoustics-based quantization Figure 8.9 shows a block diagram of a genericaudio encoder

The encoder uses a frequency-domain representation of the signal to identify the parts ofthe spectrum that play major roles in the perception of sound, and eliminate the perceptually

Trang 12

EncodedBit stream

Time/

FrequencyMapping

Quantizationand Coding

Figure 8.9 Generic block diagram of audio encoder

PCM Audio SamplesFrame

Mapping

Encoded

Bit stream

Figure 8.10 Generic block diagram of audio decoder

insignificant parts of the spectrum Figure 8.10 shows the generic block diagram of the audiodecoder The following section describes the various components in these figures

Time/Frequency Analysis

The time/frequency analysis module converts 2 ms to 50 ms long frames of PCM audiosamples (depending on the standard) to equivalent representations in the frequency domain.The number of samples in the frame depends on the sampling frequency, which varies from

16 to 48 kHz depending on the application For example, wideband speech uses a 16-kHzsampling frequency, CD quality music uses 44.1 kHz, and digital audio tape (DAT) uses

48 kHz The purpose of this operation is to map the time-domain signal into a domainwhere the representation is more clustered and compact As an example, a pure tone in thetime domain extends over many time samples, while in the frequency domain, most of theinformation is concentrated in a few transform coefficients The time/frequency analysis inmodern codecs is implemented as a filter bank The number of filters in the bank, theirbandwidths, and their center frequencies depend on the coding scheme For example, theMPEG-1 audio codec (ISO/IEC 1993b) uses 32 equally spaced subband filters Codingefficiency depends on adequately matching the analysis filter bank to the characteristics ofthe input audio signal Filter banks that emulate the analysis properties of the human auditorysystem, such as those that employ subbands resembling the ear’s nonuniform critical bands,have been highly effective in coding nonstationary audio signals Some codecs use time-varying filter banks that adjust to the signal characteristics The modified discrete cosinetransform (MDCT) is a very popular method to implement effective filter banks

Trang 13

Modified Discrete Cosine Transform (MDCT)

The MDCT is a linear orthogonal lapped transform, based on the idea of time-domainaliasing cancellation (TDAC) (Princen and Bradley 1987) The MDCT offers two distinctadvantages: (1) it has better energy compaction properties than the FFT, representing themajority of the energy in the sequence with just a few transform coefficients; and (2) ituses overlapped samples to mitigate the artifacts arising in block transforms at the frameboundaries Figure 8.11 illustrates this process Let x(k), k = 0, , 2N − 1, represent the

audio signal and w(k), k = 0, , 2N − 1, a window function of length 2N samples The

MDCT (Ramstat 1991) is defined as:

X(m)=

2

Note that the MDCT uses 2N PCM samples to generate N transform values The

transform is invertible for a symmetric window w(2N − 1 − k) = w(k), as long as the

window function satisfies the Princen–Bradley condition:

Windows applied to the MDCT are different from windows used for other types of signalanalysis, because they must fulfill the Princen–Bradley condition One of the reasons forthis difference is that MDCT windows are applied twice, once for the MDCT and oncefor the inverse MDCT (IMDCT) For MP3 and MPEG-2 AAC, the following sine window

Figure 8.12 shows the frequency response of the human auditory system for pure tones

in a quiet environment The vertical axis in this figure is the threshold of hearing measured

Frame j Frame j+1 Frame j+2 Frame j+3

0 1 ……… 2N-1

N N N N

N ……… 3N-1

MDCT N MDCT NFigure 8.11 MDCT showing 50% overlap in successive frames

Trang 14

101 102 103 104 10520

Sensitivity of the Human Auditory System to Single Pure Tones

Figure 8.12 Sensitivity of human auditory system to single pure tones

in units of sound pressure level (SPL) SPL is a measure of sound pressure level in decibelsrelative to a 20-µPa reference in air As seen here, the ear is most sensitive to frequenciesaround 3.5 kHz and not very sensitive to frequencies below 300 Hz or above 10 kHz For

a 2-kHz tone to be barely audible, its level must be at least 0 dB A 100-Hz tone, on theother hand, must have a 22-dB level to be just audible, that is, its amplitude must be tentimes higher than that of the 2-kHz tone Audio codecs take advantage of this phenomenon

by maintaining the quantization noise below this audible threshold

Frequency Masking

The response of the auditory system is nonlinear and the perception of a given tone isaffected by the presence of other tones The auditory channels for different tones interfere

with each other, giving rise to a complex auditory response called frequency masking.

Figure 8.13 illustrates the frequency-masking phenomenon when a 60-dB, 1-kHz tone

is present Superimposed on this figure are the masking threshold curves for 1-kHz and4-kHz tones The masking threshold curves intersect the threshold of the hearing curve attwo points The intersection point on the left is around 600 Hz and the intersection point onthe right is around 4 kHz This means that any tone in the masking band between 400 Hzand 4 kHz with SPL that falls below the 1-kHz masking curve will be overshadowed ormasked by the 1-kHz tone and will not be audible For example, a 2-kHz tone (shown in

Trang 15

101 102 103 104 105–20

Figure 8.13 Frequency-masking phenomenon

Figure 8.13) will not be audible unless it is louder than 10 dB In particular, the maskingbandwidth depends on the frequency of the masking tone and its level This is illustrated bythe frequency masking curve for the tone at 4 kHz As seen here, the masking bandwidth islarger for a 4-kHz tone than for a 1-kHz tone If the masking tone is louder than 60 dB, themasking band will be wider; that is a wider range of frequencies around 1 kHz or 4 kHzwill be masked Similarly, if the 1-kHz tone is weaker than 60 dB, the masking band will

be narrower Thus, louder tones will mask more neighboring frequencies than softer tones,which makes intuitive sense So, ignoring (i.e., not storing or not transmitting) the frequencycomponents in the masking band whose levels fall below the masking curve does not causeany perceptual loss

Trang 16

Figure 8.14 Temporal masking phenomenon

shows how long it takes for the auditory system to realize that there is a test tone Thisdelay time depends on the level of the test tone The louder the test tone, the sooner the eardetects it In other words, the ear thinks that the masking tone is still there, even though ithas been removed

8.2.3 Audio Coding Standards

Codec design is influenced by coding quality, application constraints (one-way versustwo-way communication, playback, streaming, etc.), signal characteristics, implementationcomplexity, and resiliency to communication errors For example, voice applications, such

as telephony, are constrained by the requirements for natural two-way communication Thismeans that the maximum two-way delay should not exceed 150 ms On the other hand,digital storage, broadcast, and streaming applications do not impose strict requirements oncoding delay This subsection reviews several audio coding standards Figure 8.15 showsvarious speech and audio applications, the corresponding quality, and bitrates

MPEG Audio Coding

The Moving Pictures Experts Group (MPEG) has produced international standards for quality and high-compression perceptual audio coding The activities of this standardizationbody have culminated in a number of successful and popular coding standards The MPEG-1audio standard was completed in 1992 MPEG-2 BC is a backward-compatible extension toMPEG-1 and was finalized in 1994 MPEG-2 AAC is a more efficient audio coding standard.MPEG-4 Audio includes tools for general audio coding and was issued in 1999 Thesestandards support audio encoding for a wide range of data rates MPEG audio standards areused in many applications Table 8.1 summarizes the applications, sampling frequencies,

Trang 17

high-Figure 8.15 Applications, data rates, and codecs

Table 8.1 MPEG audio coding standards

MPEG-1

Broadcasting,storage,multimedia, andtelecommunications

32, 44.1, 48 kHz 32–320 kbps

MPEG-2 BC Multichannel audio 16, 22.05, 24, 32,

44.1, 48 kHz 64 kbps/channelMPEG-2 AAC

Digital televisionand high-qualityaudio

16, 22.05, 24, 32,44.1, 48 kHz 48 kbps/channelMPEG-4 AAC Higher quality,

lower latency 8–48 kHz 24–64 kbps/channel

and the bitrates for various MPEG audio coding standards The following provides a briefoverview of these standards

MPEG-1

MPEG-1 Audio (ISO/IEC 1993b) is used in broadcasting, storage, multimedia, and

telecom-munications It consists of three different codecs called Layers I, II, and III and supports

Trang 18

bitrates from 32 to 320 kbps The MPEG-1 audio coder takes advantage of the masking phenomenon described previously, in which parts of a signal are not audible because

frequency-of the function frequency-of the human auditory system Sampling rates frequency-of 32, 44.1, and 48 kHz aresupported Layer III (also known as MP3) is the highest complexity mode and is optimizedfor encoding high-quality stereo audio at around 128 kbps It provides near CD-qualityaudio and is very popular because of its combination of high quality and high-compressionratio MPEG-1 supports both fixed and variable bitrate coding

MPEG-2 BC

MPEG-2 was developed for digital television MPEG-2 BC is a backward-compatible sion to MPEG-1 and consists of two extensions: (1) coding at lower sampling frequencies(16, 22.05, and 24 kHz) and (2) multichannel coding including 5.1 surround sound andmultilingual content of up to seven lingual components

exten-MPEG-2 AAC

MPEG-2 Advanced Audio Coding (AAC) is a second-generation audio codec suitable forgeneric stereo and multichannel signals (e.g., 5.1 audio) MPEG-2 AAC is not backwardcompatible with MPEG-1 and achieves transparent stereo quality (indistinguishable sourcefrom output) at 96 kbps AAC consists of three profiles: AAC Main, AAC Low Complexity(AAC-LC), and AAC Scalable Sample Rate (AAC-SSR)

MPEG-4 Low-Delay AAC

MPEG-4 Low-Delay AAC (AAC-LD) has a maximum algorithmic delay of 20 ms and goodquality for all types of audio signals, including speech and music, which makes it suitablefor two-way communication However, unlike speech codecs, the coding quality can beincreased with bitrate, because the codec is not designed around a parametric model Thequality of AAC-LD at 32 kbps is reported to be similar to AAC at 24 kbps At a bitrate

of 64 kbps, AAC-LD provides better quality than MP3 at the same bitrate and comparablequality to that of AAC at 48 kbps

MPEG-4 High Efficiency AAC

MPEG-4 High Efficiency AAC (MPEG-4 HE AAC) provides high-quality audio at lowbitrates It uses spectral band replication (SBR) to achieve excellent stereo quality at 48kbps and high quality at 32 kbps In SBR, the full-band audio spectrum is divided into alow-band and a complementary high-band section The low-band section is encoded usingthe AAC core The high-band section is not coded directly; instead, a small amount ofinformation about this band is transmitted so that the decoder can reconstruct the full-bandaudio spectrum Figure 8.16 illustrates this process

MPEG-4 HE AAC takes advantage of two facts to achieve this level of quality First,the psychoacoustic importance of the high frequencies in audio is usually relatively low.Second, there is a very high correlation between the lower and the higher frequencies of anaudio spectrum

Trang 19

Amplitude

Reconstruction by SBR

Figure 8.16 Spectral band replication in MPEG-4 HE ACC audio coder

Enhanced MPEG-4 HE AAC

Enhanced MPEG-4 HE AAC is an extension of MPEG-4 AAC and features a parametricstereo coding tool to further improve coding efficiency The coding delay is around 130 msand, therefore, this codec is not suitable for real-time two-way communication applications

In March 2004, the 3GPP agreed on making the enhanced MPEG-4 HE AAC codec optionalfor PSS audio services

8.2.4 Speech and Audio Coding Issues

This subsection discusses the challenges for enabling mobile hi-fi communication over XGwireless networks and the issues of the existing codecs in meeting these challenges A low-latency hi-fi codec is desirable for high-quality multimedia communication, as shown in thedashed oval in Figure 8.2

Hi-fi communication consists of music and speech sampled at 44.1 kHz and requireshigh bitrates Compression of multimedia content requires a unified codec that can han-dle both speech and audio signals None of the speech and audio codecs discussed in theprevious sections satisfy the requirements of low-latency hi-fi multimedia communication.The major limitation of most speech codecs is that they are highly optimized for speechsignals and therefore lack the flexibility to represent general audio signals On the otherhand, many audio codecs are designed for music distribution and streaming applications,where high delay can be tolerated Voice communication requires low latency, renderingmost audio codecs unsuitable for speech coding Although today’s codecs provide a signif-icant improvement in coding efficiency, their quality is limited at the data rates commonlyseen in wireless networks AMR-WB provides superior speech quality at 16 kbps and haslow latency, but it cannot provide high-quality audio as its performance is optimized forspeech sampled at 16 kHz, not 44.1 kHz MPEG-4 HE AAC provides high-quality audio at

Trang 20

24 kbps/channel, but is suitable for broadcast applications, not low-latency communication.The low-delay version of AAC (AAC-LD) provides transparent quality at 64 kbps/channel.Even with the increases in bandwidth promised by XG, this rate is high and more efficientcodecs will be required for XG networks.

The inherent capriciousness of wireless networks, and the fact that media is often ported over unreliable channels, may result in occasional loss of media packets This makesresiliency to packet loss a desirable feature One of the requirements of XG is seamlesscommunication across heterogeneous networks, devices, and access technologies To accom-modate this heterogeneity, media streams have to adapt themselves to the bandwidth anddelay constraints imposed by the various technologies Multimode or scalable codecs canfulfill this requirement Scalability is a feature that allows the decoder to operate with par-tial information from the encoder and is advantageous in heterogeneous and packet-basednetworks, such as the Internet, where variable delay conditions may limit the availabil-ity of a portion of the bitstream The main advantage of scalability is that it eliminatestranscoding

trans-Enhanced multimedia services can benefit from realistic virtual experiences involving 3Dsound Present codecs lack functionality for 3D audio Finally, high-quality playback oversmall loudspeakers used in mobile devices is essential in delivering high-quality content

8.2.5 Further Research

The following enabling technologies are needed to realize low-latency hi-fi mobile nication over XG networks

commu-• Unified speech and audio coding at 44.1 kHz

• Improved audio quality from small loudspeakers in mobile devices

• 3D audio functionalities on mobile devices

Generally speaking, the increase in functionality and performance of future mobile erations will be at the cost of higher complexity The effect of Moore’s Law is expected

gen-to offset that increase The following are the specific research directions gen-to enable thetechnologies mentioned above

Unified Speech and Audio Coding

Today’s mobile devices typically use two codecs: one for speech and one for audio Aunified codec is highly desirable because it greatly simplifies implementation and is morerobust under most real-world conditions

Several approaches have been proposed for unified coding One approach is to use arate speech and audio codecs and switch them according to the property of the signal TheMPEG-4 standard, for example, proposes the use of a signal classification mechanism inwhich a speech codec and an audio codec are switched according to the property of thesignal In particular, the HVXC standard can be used to handle speech while the harmonicand individual lines plus noise (HILN) standard is used to handle music (Herre and Purn-hagen 2002; Nishiguchi and Edler 2002) Even though this combination provides reasonablequality under certain conditions, it is vulnerable to classification errors Further research to

Ngày đăng: 14/08/2014, 09:21

TỪ KHÓA LIÊN QUAN