Compression is a way ofexpressing digital audio andvideo by using less data.. MPEG coding is divided intoseveral profiles that have differentcomplexity, and each profile can be implement
Trang 1A Guide to MPEG Fundamentals and Protocol Analysis (Including DVB and ATSC)
Trang 2Section 1 Introduction to MPEG 3
1.1 Convergence 3
1.2 Why compression is needed 3
1.3 Applications of compression 3
1.4 Introduction to video compression 4
1.5 Introduction to audio compression 6
1.6 MPEG signals 6
1.7 Need for monitoring and analysis 7
1.8 Pitfalls of compression 7
Section 2 Compression in Video 8
2.1 Spatial or temporal coding? 8
2.2 Spatial coding 8
2.3 Weighting 9
2.4 Scanning 11
2.5 Entropy coding 11
2.6 A spatial coder 11
2.7 Temporal coding 12
2.8 Motion compensation 13
2.9 Bidirectional coding 14
2.10 I, P, and B pictures 14
2.11 An MPEG compressor 16
2.12 Preprocessing 19
2.13 Profiles and levels 20
2.14 Wavelets 21
Section 3 Audio Compression .22
3.1 The hearing mechanism 22
3.2 Subband coding 23
3.3 MPEG Layer 1 24
3.4 MPEG Layer 2 25
3.5 Transform coding 25
3.6 MPEG Layer 3 25
3.7 AC-3 25
Section 4 Elementary Streams 26
4.1 Video elementary stream syntax 26
4.2 Audio elementary streams 27
Contents Section 5 Packetized Elementary Streams (PES) 28
5.1 PES packets 28
5.2 Time stamps 28
5.3 PTS/DTS 28
Section 6 Program Streams 29
6.1 Recording vs transmission 29
6.2 Introduction to program streams 29
Section 7 Transport streams 30
7.1 The job of a transport stream 30
7.2 Packets 30
7.3 Program Clock Reference (PCR) 31
7.4 Packet Identification (PID) 31
7.5 Program Specific Information (PSI) 32
Section 8 Introduction to DVB/ATSC 33
8.1 An overall view 33
8.2 Remultiplexing 33
8.3 Service Information (SI) 34
8.4 Error correction 34
8.5 Channel coding 35
8.6 Inner coding 36
8.7 Transmitting digits 37
Section 9 MPEG Testing 38
9.1 Testing requirements 38
9.2 Analyzing a Transport Stream 38
9.3 Hierarchic view 39
9.4 Interpreted view 40
9.5 Syntax and CRC analysis 41
9.6 Filtering 41
9.7 Timing Analysis 42
9.8 Elementary stream testing 43
9.9 Sarnoff compliant bit streams 43
9.10 Elementary stream analysis 43
9.11 Creating a transport stream 44
9.12 Jitter generation 44
9.13 DVB tests 45
Glossary 46
Trang 3SECTION 1
INTRODUCTION TO MPEG
MPEG is one of the most popular
audio/video compression
tech-niques because it is not just a
single standard Instead it is a
range of standards suitable for
different applications but based
on similar principles MPEG is
an acronym for the Moving
Picture Experts Group which was
set up by the ISO (International
Standards Organization) to work
on compression
MPEG can be described as the
interaction of acronyms As ETSI
stated "The CAT is a pointer to
enable the IRD to find the EMMs
associated with the CA system(s)
that it uses." If you can
under-stand that sentence you don't
need this book
1.1 Convergence
Digital techniques have made
rapid progress in audio and
video for a number of reasons
Digital information is more robust
and can be coded to substantially
eliminate error This means that
generation loss in recording and
losses in transmission are
elimi-nated The Compact Disc was
the first consumer product to
demonstrate this
While the CD has an improved
sound quality with respect to its
vinyl predecessor, comparison of
quality alone misses the point
The real point is that digital
recording and transmission
tech-niques allow content
manipula-tion to a degree that is impossible
with analog Once audio or video
are digitized they become data
Such data cannot be distinguishedfrom any other kind of data;
therefore, digital video andaudio become the province ofcomputer technology
The convergence of computersand audio/video is an inevitableconsequence of the key inventions
of computing and Pulse CodeModulation Digital media canstore any type of information, so
it is easy to utilize a computerstorage device for digital video
The nonlinear workstation wasthe first example of an application
of convergent technology thatdid not have an analog forerunner
Another example, multimedia,mixed the storage of audio, video,graphics, text and data on thesame medium Multimedia isimpossible in the analog domain
1.2 Why compression is neededThe initial success of digitalvideo was in post-productionapplications, where the high cost
of digital video was offset by itslimitless layering and effectscapability However, production-standard digital video generatesover 200 megabits per second ofdata and this bit rate requiresextensive capacity for storageand wide bandwidth for trans-mission Digital video could only
be used in wider applications ifthe storage and bandwidthrequirements could be eased;
easing these requirements is thepurpose of compression
Compression is a way ofexpressing digital audio andvideo by using less data
Compression has the followingadvantages:
A smaller amount of storage isneeded for a given amount ofsource material With high-density recording, such as withtape, compression allows highlyminiaturized equipment for consumer and Electronic NewsGathering (ENG) use The accesstime of tape improves with com-pression because less tape needs
to be shuttled to skip over agiven amount of program Withexpensive storage media such asRAM, compression makes newapplications affordable
When working in real time, pression reduces the bandwidthneeded Additionally, compres-sion allows faster-than-real-timetransfer between media, forexample, between tape and disk
com-A compressed recording formatcan afford a lower recordingdensity and this can make therecorder less sensitive to environmental factors and maintenance
1.3 Applications of compressionCompression has a long associa-tion with television Interlace is
a simple form of compressiongiving a 2:1 reduction in band-width The use of color-differencesignals instead of GBR is anotherform of compression Becausethe eye is less sensitive to colordetail, the color-difference signalsneed less bandwidth Whencolor broadcasting was intro-duced, the channel structure ofmonochrome had to be retainedand composite video was devel-oped Composite video systems,such as PAL, NTSC and SECAM,are forms of compression becausethey use the same bandwidth forcolor as was used for monochrome
Trang 4Figure 1.1a shows that in tional television systems, theGBR camera signal is converted
tradi-to Y, Pr, Pb components for duction and encoded into ana-logue composite for transmission
pro-Figure 1.1b shows the modernequivalent The Y, Pr, Pb signalsare digitized and carried as Y,
Cr, Cb signals in SDI formthrough the production processprior to being encoded withMPEG for transmission Clearly,MPEG can be considered by thebroadcaster as a more efficientreplacement for compositevideo In addition, MPEG hasgreater flexibility because the bitrate required can be adjusted tosuit the application At lower bitrates and resolutions, MPEG can
be used for video conferencingand video telephones
DVB and ATSC (the and American-originated digital-television broadcasting standards)would not be viable withoutcompression because the band-width required would be toogreat Compression extends theplaying time of DVD (digitalvideo/versatile disc) allowingfull-length movies on a standardsize compact disc Compressionalso reduces the cost of ElectronicNews Gathering and other contri-butions to television production
European-In tape recording, mild sion eases tolerances and addsreliability in Digital Betacam andDigital-S, whereas in SX, DVC,DVCPRO and DVCAM, the goal
compres-is miniaturization In magneticdisk drives, such as the TektronixProfile®
storage system, that areused in file servers and networks(especially for news purposes),
compression lowers storage cost
Compression also lowers width, which allows more users
band-to access a given server Thischaracteristic is also importantfor VOD (Video On Demand)applications
1.4 Introduction to video compression
In all real program material,there are two types of components
of the signal: those which arenovel and unpredictable andthose which can be anticipated
The novel component is calledentropy and is the true informa-tion in the signal The remainder
is called redundancy because it
is not essential Redundancy may
be spatial, as it is in large plainareas of picture where adjacentpixels have almost the samevalue Redundancy can also betemporal as it is where similaritiesbetween successive pictures areused All compression systemswork by separating the entropyfrom the redundancy in theencoder Only the entropy isrecorded or transmitted and thedecoder computes the redundancyfrom the transmitted signal
Figure 1.2a shows this concept
An ideal encoder would extractall the entropy and only this will
be transmitted to the decoder
An ideal decoder would thenreproduce the original signal Inpractice, this ideal cannot bereached An ideal coder would
be complex and cause a verylong delay in order to use tem-poral redundancy In certainapplications, such as recording
or broadcasting, some delay isacceptable, but in videoconfer-
encing it is not In some cases, avery complex coder would betoo expensive It follows thatthere is no one ideal compres-sion system
In practice, a range of coders isneeded which have a range ofprocessing delays and complexi-ties The power of MPEG is that
it is not a single compressionformat, but a range of standard-ized coding tools that can becombined flexibly to suit a range
of applications The way inwhich coding has been performed
is included in the compresseddata so that the decoder canautomatically handle whateverthe coder decided to do
MPEG coding is divided intoseveral profiles that have differentcomplexity, and each profile can
be implemented at a differentlevel depending on the resolution
of the input picture Section 2considers profiles and levels
in detail
There are many different digitalvideo formats and each has a different bit rate For example ahigh definition system mighthave six times the bit rate of astandard definition system.Consequently just knowing thebit rate out of the coder is notvery useful What matters is thecompression factor, which is theratio of the input bit rate to thecompressed bit rate, for example2:1, 5:1, and so on
Unfortunately the number ofvariables involved make it verydifficult to determine a suitablecompression factor Figure 1.2ashows that for an ideal coder, ifall of the entropy is sent, thequality is good However, if thecompression factor is increased
in order to reduce the bit rate,not all of the entropy is sent andthe quality falls Note that in acompressed system when thequality loss occurs, compression
is steep (Figure 1.2b) If theavailable bit rate is inadequate,
it is better to avoid this area byreducing the entropy of theinput picture This can be done
by filtering The loss of tion caused by the filtering is
resolu-AnalogCompositeOut
(PAL, NTSC
or SECAM)B
GR
YPrPb
DigitalCompressedOut
BGR
YPrPb
YCrCb
YCrCb
MPEGCoder
a)
MatrixCamera
Camera
CompositeEncoder
Trang 5To identify the entropy perfectly,
an ideal compressor would have
to be extremely complex A
practical compressor may be less
complex for economic reasons
and must send more data to be
sure of carrying all of the entropy
Figure 1.2b shows the relationship
between coder complexity and
performance The higher the
com-pression factor required, the more
complex the encoder has to be
The entropy in video signals
varies A recording of an
announcer delivering the news
has much redundancy and is easy
to compress In contrast, it is
more difficult to compress a
recording with leaves blowing in
the wind or one of a football
crowd that is constantly moving
and therefore has less redundancy
(more information or entropy) In
either case, if all the entropy is
not sent, there will be quality loss
Thus, we may choose between a
constant bit-rate channel with
variable quality or a constant
quality channel with variable bit
rate Telecommunications network
operators tend to prefer a constant
bit rate for practical purposes,
but a buffer memory can be used
to average out entropy variations
if the resulting increase in delay
is acceptable In recording, a
variable bit rate maybe easier to
handle and DVD uses variable
bit rate, speeding up the disc
where difficult material exists
Intra-coding (intra = within) is a
technique that exploits spatial
redundancy, or redundancy
within the picture; inter-coding
(inter = between) is a technique
that exploits temporal redundancy
Intra-coding may be used alone,
as in the JPEG standard for still
pictures, or combined with
inter-coding as in MPEG
Intra-coding relies on two
char-acteristics of typical images
First, not all spatial frequencies
are simultaneously present, and
second, the higher the spatial
frequency, the lower the
ampli-tude is likely to be Intra-coding
requires analysis of the spatial
frequencies in an image This
analysis is the purpose of
trans-forms such as wavelets and DCT
(discrete cosine transform)
Transforms produce coefficients
which describe the magnitude of
Typically, many coefficients will
be zero, or nearly zero, and thesecoefficients can be omitted,resulting in a reduction in bit rate
Inter-coding relies on findingsimilarities between successivepictures If a given picture isavailable at the decoder, the nextpicture can be created by sendingonly the picture differences Thepicture differences will beincreased when objects move,but this magnification can be offset by using motion compen-sation, since a moving objectdoes not generally change itsappearance very much from onepicture to the next If the motioncan be measured, a closer approx-imation to the current picturecan be created by shifting part ofthe previous picture to a newlocation The shifting process iscontrolled by a vector that istransmitted to the decoder Thevector transmission requires lessdata than sending the picture-difference data
MPEG can handle both interlacedand non-interlaced images Animage at some point on the timeaxis is called a "picture," whether
it is a field or a frame Interlace
is not ideal as a source for digital
compression because it is in itself a compression technique
Temporal coding is made morecomplex because pixels in onefield are in a different position tothose in the next
Motion compensation minimizesbut does not eliminate thedifferences between successivepictures The picture-difference
is itself a spatial image and can
be compressed using based intra-coding as previouslydescribed Motion compensationsimply reduces the amount ofdata in the difference image
transform-The efficiency of a temporalcoder rises with the time spanover which it can act Figure 1.2cshows that if a high compressionfactor is required, a longer timespan in the input must be con-sidered and thus a longer codingdelay will be experienced Clearlytemporally coded signals are dif-ficult to edit because the content
of a given output picture may bebased on image data which wastransmitted some time earlier
Production systems will have
to limit the degree of temporalcoding to allow editing and thislimitation will in turn limit theavailable compression factor
Short DelayCoder has tosend even more
Non-IdealCoder has tosend more
Ideal Codersends onlyEntropyEntropy
PCM Video
WorseQuality
BetterQuality
Latency
WorseQuality
BetterQuality
Complexitya)
Figure 1.2.
Trang 6Stream differs from a ProgramStream in that the PES packetsare further subdivided into shortfixed-size packets and in thatmultiple programs encoded withdifferent clocks can be carried.This is possible because a trans-port stream has a program clockreference (PCR) mechanismwhich allows transmission ofmultiple clocks, one of which isselected and regenerated at thedecoder A Single ProgramTransport Stream (SPTS) is alsopossible and this may be foundbetween a coder and a multi-plexer Since a Transport Streamcan genlock the decoder clock
to the encoder clock, the SingleProgram Transport Stream(SPTS) is more common thanthe Program Stream
A Transport Stream is more thanjust a multiplex of audio andvideo PES In addition to thecompressed audio, video anddata, a Transport Streamincludes a great deal of metadatadescribing the bit stream Thisincludes the Program AssociationTable (PAT) that lists every pro-gram in the transport stream.Each entry in the PAT points to
a Program Map Table (PMT) thatlists the elementary streamsmaking up each program Someprograms will be open, but someprograms may be subject to con-ditional access (encryption) andthis information is also carried
in the metadata
The Transport Stream consists
of fixed-size data packets, eachcontaining 188 bytes Eachpacket carries a packet identifiercode (PID) Packets in the sameelementary stream all have thesame PID, so that the decoder(or a demultiplexer) can selectthe elementary stream(s) itwants and reject the remainder.Packet-continuity counts ensurethat every packet that is needed
to decode a stream is received
An effective synchronizationsystem is needed so thatdecoders can correctly identifythe beginning of each packetand deserialize the bit streaminto words
complicating audio compression
is that delayed resonances inpoor loudspeakers actually maskcompression artifacts Testing acompressor with poor speakersgives a false result, and signalswhich are apparently satisfactorymay be disappointing whenheard on good equipment
1.6 MPEG signalsThe output of a single MPEGaudio or video coder is called
an Elementary Stream AnElementary Stream is an endlessnear real-time signal For conve-nience, it can be broken intoconvenient-sized data blocks in
a Packetized Elementary Stream(PES) These data blocks needheader information to identifythe start of the packets and mustinclude time stamps becausepacketizing disrupts the time axis
Figure 1.3 shows that one videoPES and a number of audio PEScan be combined to form aProgram Stream, provided thatall of the coders are locked to acommon clock Time stamps ineach PES ensure lip-syncbetween the video and audio
Program Streams have length packets with headers
variable-They find use in data transfers
to and from optical and harddisks, which are error free and in which files of arbitrarysizes are expected DVD usesProgram Streams
For transmission and digitalbroadcasting, several programsand their associated PES can bemultiplexed into a singleTransport Stream A Transport
1.5 Introduction to audio compression
The bit rate of a PCM digitalaudio channel is only about onemegabit per second, which isabout 0.5% of 4:2:2 digital video
With mild video compressionschemes, such as DigitalBetacam, audio compression isunnecessary But, as the videocompression factor is raised, itbecomes necessary to compressthe audio as well
Audio compression takes tage of two facts First, in typicalaudio signals, not all frequenciesare simultaneously present
advan-Second, because of the enon of masking, human hearingcannot discern every detail of
phenom-an audio signal Audio sion splits the audio spectruminto bands by filtering or trans-forms, and includes less datawhen describing bands in whichthe level is low Where maskingprevents or reduces audibility of
compres-a pcompres-articulcompres-ar bcompres-and, even less dcompres-atcompres-aneeds to be sent
Audio compression is not aseasy to achieve as is video com-pression because of the acuity ofhearing Masking only worksproperly when the masking andthe masked sounds coincidespatially Spatial coincidence isalways the case in mono record-ings but not in stereo recordings,where low-level signals can still be heard if they are in a different part of the soundstage
Consequently, in stereo and round sound systems, a lowercompression factor is allowablefor a given quality Another factor
VideoPESAudioPES
Data
ProgramStream(DVD)
SingleProgramTransportStream
VideoEncoder
AudioEncoder
Packetizer
Packetizer
ProgramStreamMUX
TransportStreamMUX
Trang 71.7 Need for monitoring and analysis
The MPEG transport stream is
an extremely complex structure
using interlinked tables and
coded identifiers to separate the
programs and the elementary
streams within the programs
Within each elementary stream,
there is a complex structure,
allowing a decoder to distinguish
between, for example, vectors,
coefficients and quantization
tables
Failures can be divided into
two broad categories In the first
category, the transport system
correctly multiplexes and delivers
information from an encoder to
a decoder with no bit errors or
added jitter, but the encoder or
the decoder has a fault In the
second category, the encoder
and decoder are fine, but the
transport of data from one to the
other is defective It is very
important to know whether the
fault lies in the encoder, the
transport, or the decoder if a
prompt solution is to be found
Synchronizing problems, such
as loss or corruption of sync
patterns, may prevent reception
of the entire transport stream
Transport-stream protocol
defects may prevent the decoder
from finding all of the data for a
program, perhaps delivering
picture but not sound Correct
delivery of the data but with
excessive jitter can cause decoder
timing problems
If a system using an MPEG
transport stream fails, the fault
could be in the encoder, the
multiplexer, or in the decoder
How can this fault be isolated?
First, verify that a transport
stream is compliant with the
MPEG-coding standards If the
stream is not compliant, a
decoder can hardly be blamed
for having difficulty If it is, the
decoder may need attention
Traditional video testing tools,the signal generator, the wave-form monitor and vectorscope,are not appropriate in analyzingMPEG systems, except to ensurethat the video signals enteringand leaving an MPEG systemare of suitable quality Instead,
a reliable source of valid MPEGtest signals is essential for testing receiving equipment and decoders With a suitableanalyzer, the performance ofencoders, transmission systems,multiplexers and remultiplexerscan be assessed with a highdegree of confidence As a longstanding supplier of high gradetest equipment to the videoindustry, Tektronix continues toprovide test and measurementsolutions as the technologyevolves, giving the MPEG userthe confidence that complexcompressed systems are correctlyfunctioning and allowing rapiddiagnosis when they are not
1.8 Pitfalls of compressionMPEG compression is lossy inthat what is decoded, is notidentical to the original Theentropy of the source varies,and when entropy is high, thecompression system may leavevisible artifacts when decoded
In temporal compression,redundancy between successivepictures is assumed When this
is not the case, the system fails
An example is video from a pressconference where flashguns arefiring Individual pictures con-taining the flash are totally dif-ferent from their neighbors, andcoding artifacts become obvious
Irregular motion or several independently moving objects
on screen require a lot of vectorbandwidth and this requirementmay only be met by reducingthe picture-data bandwidth
Again, visible artifacts may
occur whose level varies anddepends on the motion Thisproblem often occurs in sports-coverage video
Coarse quantizing results inluminance contouring and pos-terized color These can be seen
as blotchy shadows and blocking
on large areas of plain color
Subjectively, compression artifactsare more annoying than the relatively constant impairmentsresulting from analog televisiontransmission systems
The only solution to these lems is to reduce the compressionfactor Consequently, the com-pression user has to make avalue judgment between theeconomy of a high compressionfactor and the level of artifacts
prob-In addition to extending theencoding and decoding delay,temporal coding also causes difficulty in editing In fact, anMPEG bit stream cannot be arbitrarily edited at all Thisrestriction occurs because intemporal coding the decoding
of one picture may require thecontents of an earlier pictureand the contents may not beavailable following an edit
The fact that pictures may besent out of sequence also complicates editing
If suitable coding has been used,edits can take place only atsplice points, which are relativelywidely spaced If arbitrary editing
is required, the MPEG streammust undergo a read-modify-write process, which will result
in generation loss
The viewer is not interested inediting, but the production userwill have to make another valuejudgment about the edit flexibilityrequired If greater flexibility isrequired, the temporal compres-sion has to be reduced and ahigher bit rate will be needed
Trang 8sufficient accuracy, the output
of the inverse transform is tical to the original waveform.The most well known transform
iden-is the Fourier transform Thiden-istransform finds each frequency
in the input signal It finds eachfrequency by multiplying theinput waveform by a sample of
a target frequency, called a basisfunction, and integrating theproduct Figure 2.1 shows thatwhen the input waveform doesnot contain the target frequency,the integral will be zero, butwhen it does, the integral will
be a coefficient describing theamplitude of that componentfrequency
The results will be as described
if the frequency component is inphase with the basis function.However if the frequency com-ponent is in quadrature with thebasis function, the integral willstill be zero Therefore, it is necessary to perform twosearches for each frequency,with the basis functions inquadrature with one another sothat every phase of the inputwill be detected
The Fourier transform has thedisadvantage of requiring coeffi-cients for both sine and cosinecomponents of each frequency
In the cosine transform, the inputwaveform is time-mirrored withitself prior to multiplication bythe basis functions Figure 2.2shows that this mirroring cancelsout all sine components anddoubles all of the cosine compo-nents The sine basis function
is unnecessary and only onecoefficient is needed for eachfrequency
The discrete cosine transform(DCT) is the sampled version ofthe cosine transform and is usedextensively in two-dimensionalform in MPEG A block of 8 x 8pixels is transformed to become
a block of 8 x 8 coefficients.Since the transform requiresmultiplication by fractions,there is wordlength extension,resulting in coefficients thathave longer wordlength than thepixel values Typically an 8-bit
Spatial compression relies onsimilarities between adjacentpixels in plain areas of pictureand on dominant spatial fre-quencies in areas of patterning
The JPEG system uses spatialcompression only, since it isdesigned to transmit individualstill pictures However, JPEGmay be used to code a succession
of individual pictures for video
In the so-called "Motion JPEG"
application, the compressionfactor will not be as good as iftemporal coding was used, butthe bit stream will be freelyeditable on a picture-by-picture basis
2.2 Spatial codingThe first step in spatial coding
is to perform an analysis of tial frequency using a transform
spa-A transform is simply a way ofexpressing a waveform in a dif-ferent domain, in this case, thefrequency domain The output
of a transform is a set of cients that describe how much
coeffi-of a given frequency is present
An inverse transform reproducesthe original waveform If thecoefficients are handled with
SECTION 2 COMPRESSION IN VIDEOThis section shows how videocompression is based on theperception of the eye Importantenabling techniques, such astransforms and motion compen-sation, are considered as anintroduction to the structure of
an MPEG coder
2.1 Spatial or temporal coding?
As was seen in Section 1, videocompression can take advantage
of both spatial and temporalredundancy In MPEG, temporalredundancy is reduced first byusing similarities between suc-cessive pictures As much aspossible of the current picture iscreated or "predicted" by usinginformation from pictures alreadysent When this technique isused, it is only necessary tosend a difference picture, whicheliminates the differencesbetween the actual picture andthe prediction The differencepicture is then subject to spatialcompression As a practical mat-ter it is easier to explain spatialcompression prior to explainingtemporal compression
No Correlation
if FrequencyDifferent
High Correlation
if Frequencythe Same
Sine ComponentInverts at Mirror – CancelsFigure 2.1.
Trang 9Figure 2.3.
pixel block results in an 11-bit
coefficient block Thus, a DCT
does not result in any
compres-sion; in fact it results in the
opposite However, the DCT
converts the source pixels into a
form where compression is easier
Figure 2.3 shows the results of
an inverse transform of each of
the individual coefficients of an
8 x 8 DCT In the case of the
luminance signal, the top-left
coefficient is the average
bright-ness or DC component of the
whole block Moving across
the top row, horizontal spatial
frequency increases Moving
down the left column, vertical
spatial frequency increases In
real pictures, different vertical
and horizontal spatial
frequen-cies may occur simultaneously
and a coefficient at some point
within the block will represent
all possible horizontal and
vertical combinations
Figure 2.3 also shows the
coefficients as a one dimensional
horizontal waveform Combining
these waveforms with various
amplitudes and either polarity
can reproduce any combination
of 8 pixels Thus combining the
64 coefficients of the 2-D DCT
will result in the original 8 x 8
pixel block Clearly for color
pictures, the color difference
samples will also need to be
handled Y, Cr, and Cb data are
assembled into separate 8 x 8
arrays and are transformed
individually
In much real program material,
many of the coefficients will
have zero or near zero values
and, therefore, will not be
transmitted This fact results in
significant compression that is
virtually lossless If a higher
compression factor is needed,
then the wordlength of the
non-zero coefficients must be reduced
This reduction will reduce
accu-racy of these coefficients and
will introduce losses into the
process With care, the losses
can be introduced in a way that
is least visible to the viewer
2.3 WeightingFigure 2.4 shows that thehuman perception of noise inpictures is not uniform but is afunction of the spatial frequency
More noise can be tolerated athigh spatial frequencies Also,video noise is effectively masked
by fine detail in the picture,whereas in plain areas it is highlyvisible The reader will be awarethat traditional noise measure-ments are always weighted so
that the technical measurementrelates to the subjective result
Compression reduces the accuracy
of coefficients and has a similareffect to using shorter wordlengthsamples in PCM; that is, thenoise level rises In PCM, theresult of shortening the word-length is that the noise levelrises equally at all frequencies
As the DCT splits the signal intodifferent frequencies, it becomespossible to control the spectrum
of the noise Effectively, frequency coefficients are
low-Horizontal spatialfrequency waveforms
HV
HumanVisionSensitivity
Spatial FrequencyFigure 2.4.
Trang 10As an alternative to truncation,weighted coefficients may benonlinearly requantized so thatthe quantizing step size increaseswith the magnitude of the coef-ficient This technique allowshigher compression factors butworse levels of artifacts.
Clearly, the degree of sion obtained and, in turn, theoutput bit rate obtained, is afunction of the severity of therequantizing process Differentbit rates will require differentweighting tables In MPEG, it ispossible to use various differentweighting tables and the table inuse can be transmitted to thedecoder, so that correct decodingautomatically occurs
compres-increased noise Coefficientsrepresenting higher spatial fre-quencies are requantized withlarge steps and suffer morenoise However, fewer stepsmeans that fewer bits are needed
to identify the step and a pression is obtained
com-In the decoder, a low-order zerowill be added to return theweighted coefficients to theircorrect magnitude They willthen be multiplied by inverseweighting factors Clearly, athigh frequencies the multiplica-tion factors will be larger, so therequantizing noise will be greater
Following inverse weighting,the coefficients will have theiroriginal DCT output values, plusrequantizing error, which will
be greater at high frequencythan at low frequency
Figure 2.5 shows that, in theweighting process, the coeffi-cients from the DCT are divided
by constants that are a function
of two-dimensional frequency
Low-frequency coefficients will
be divided by small numbers,and high-frequency coefficientswill be divided by large numbers
Following the division, theleast-significant bit is discarded
or truncated This truncation is
a form of requantizing In theabsence of weighting, thisrequantizing would have theeffect of doubling the size of thequantizing step, but with weight-ing, it increases the step sizeaccording to the division factor
As a result, coefficients senting low spatial frequenciesare requantized with relativelysmall steps and suffer little
repre-Input DCT Coefficients
not actual results
Quant Matrix ValuesValue used corresponds
to the coefficient location
Quant Scale ValuesNot all code values are shownOne value used for complete 8x8 block
Divide byQuantMatrix
Divide byQuantScale
12752210
1363440
68221
1110
222226272729
2627292935
29323438
2163240485662
182440
8811256
Figure 2.5.
Trang 11Figure 2.6.
2.6 A spatial coderFigure 2.7 ties together all of the preceding spatial codingconcepts The input signal isassumed to be 4:2:2 SDI (SerialDigital Interface), which mayhave 8- or 10-bit wordlength
MPEG uses only 8-bit resolutiontherefore, a rounding stage will
be needed when the SDI signalcontains 10-bit words MostMPEG profiles operate with 4:2:0sampling; therefore, a verticallow pass filter/interpolationstage will be needed Roundingand color subsampling intro-duces a small irreversible loss ofinformation and a proportionalreduction in bit rate The rasterscanned input format will need
to be stored so that it can be converted to 8 x 8 pixel blocks
(RLC) allows these coefficients
to be handled more efficiently
Where repeating values, such as
a string of 0s, are present, runlength coding simply transmitsthe number of zeros rather thaneach individual bit
The probability of occurrence ofparticular coefficient values inreal video can be studied Inpractice, some values occur veryoften; others occur less often
This statistical information can
be used to achieve further pression using variable lengthcoding (VLC) Frequently occur-ring values are converted toshort code words, and infrequentvalues are converted to longcode words To aid deserializa-tion, no code word can be theprefix of another
com-2.4 Scanning
In typical program material, the
significant DCT coefficients are
generally found in the top-left
corner of the matrix After
weighting, low-value coefficients
might be truncated to zero
More efficient transmission can
be obtained if all of the non-zero
coefficients are sent first,
fol-lowed by a code indicating that
the remainder are all zero
Scanning is a technique which
increases the probability of
achieving this result, because it
sends coefficients in descending
order of magnitude probability
Figure 2.6a shows that in a
non-interlaced system, the probability
of a coefficient having a high
value is highest in the top-left
corner and lowest in the
bottom-right corner A 45 degree diagonal
zig-zag scan is the best sequence
to use here
In Figure 2.6b, the scan for an
interlaced source is shown In
an interlaced picture, an 8 x 8
DCT block from one field
extends over twice the vertical
screen area, so that for a given
picture detail, vertical frequencies
will appear to be twice as great
as horizontal frequencies Thus,
the ideal scan for an interlaced
picture will be on a diagonal
that is twice as steep Figure 2.6b
shows that a given vertical
spa-tial frequency is scanned before
scanning the same horizontal
spatial frequency
2.5 Entropy coding
In real video, not all spatial
frequencies are simultaneously
present; therefore, the DCT
coefficient matrix will have zero
terms in it Despite the use of
scanning, zero coefficients will
still appear between the
signifi-cant values Run length coding
Zigzag or Classic (nominally for frames)
Rate ControlQuantizing Data
CompressedData
Data reduced(no loss)
Data reduced(information lost)
No Loss
No Data reduced
Information lostData reduced
Convert4:2:2 to
Reduce the number of bits for each coefficient
Give preference to certain coefficients
Reduction can differ for each coefficient
Variable Length CodingUse short words for most frequentvalues (like Morse Code)
Run Length CodingSend a unique code word instead
of strings of zerosFigure 2.7.
Trang 12To obtain a 4:2:2 output from4:2:0 data, a vertical interpolationprocess will be needed as shown
in Figure 2.8
The chroma samples in 4:2:0 arepositioned half way betweenluminance samples in the verticalaxis so that they are evenlyspaced when an interlacedsource is used
2.7 Temporal codingTemporal redundancy can beexploited by inter-coding ortransmitting only the differencesbetween pictures Figure 2.9shows that a one-picture delaycombined with a subtractor cancompute the picture differences.The picture difference is animage in its own right and can
be further compressed by thespatial coder as was previouslydescribed The decoder reversesthe spatial coding and adds thedifference picture to the previouspicture to obtain the next picture.There are some disadvantages tothis simple system First, asonly differences are sent, it isimpossible to begin decodingafter the start of the transmission.This limitation makes it difficultfor a decoder to provide picturesfollowing a switch from one bit stream to another (as occurswhen the viewer changes chan-nels) Second, if any part of thedifference data is incorrect, theerror in the picture will propagate indefinitely
The solution to these problems is
to use a system that is not pletely differential Figure 2.10shows that periodically completepictures are sent These arecalled Intra-coded pictures (or I-pictures), and they are obtained
com-by spatial compression only If
an error or a channel switchoccurs, it will be possible toresume correct decoding at thenext I-picture
system, a buffer memory is used to absorb variations in coding difficulty Highlydetailed pictures will tend to fillthe buffer, whereas plain pic-tures will allow it to empty Ifthe buffer is in danger of over-flowing, the requantizing stepswill have to be made larger, sothat the compression factor iseffectively raised
In the decoder, the bit stream isdeserialized and the entropycoding is reversed to reproducethe weighted coefficients Theinverse weighting is applied andcoefficients are placed in thematrix according to the zig-zagscan to recreate the DCT matrix
The DCT stage transforms thepicture information to the fre-quency domain The DCT itselfdoes not achieve any compres-sion Following DCT, the coeffi-cients are weighted and truncated,providing the first significantcompression The coefficientsare then zig-zag scanned toincrease the probability that thesignificant coefficients occurearly in the scan After the lastnon-zero coefficient, an EOB(end of block) code is generated
Coefficient data are further pressed by run length and vari-able length coding In a variablebit-rate system, the quantizing isfixed, but in a fixed bit-rate
Picture Delay
+_
Figure 2.9.
DifferenceDifference
Figure 2.10.
Trang 13Figure 2.11.
objects in the image and therewill be occasions where part ofthe macroblock moves and part
of it does not In this case, it isimpossible to compensate prop-erly If the motion of the movingpart is compensated by trans-mitting a vector, the stationarypart will be incorrectly shifted,and it will need difference data
to be corrected If no vector issent, the stationary part will becorrect, but difference data will
be needed to correct the movingpart A practical compressormight attempt both strategiesand select the one which requiredthe least difference data
with a resolution of half a pixelover the entire search range
When the greatest correlation isfound, this correlation is assumed
to represent the correct motion
The motion vector has a verticaland horizontal component Intypical program material,motion continues over a number
of pictures A greater compressionfactor is obtained if the vectorsare transmitted differentially
Consequently, if an objectmoves at constant speed, thevectors do not change and thevector difference is zero
Motion vectors are associatedwith macroblocks, not with real
2.8 Motion compensation
Motion reduces the similarities
between pictures and increases
the data needed to create the
difference picture Motion
com-pensation is used to increase the
similarity Figure 2.11 shows the
principle When an object moves
across the TV screen, it may
appear in a different place in
each picture, but it does not
change in appearance very much
The picture difference can be
reduced by measuring the motion
at the encoder This is sent to
the decoder as a vector The
decoder uses the vector to shift
part of the previous picture to a
more appropriate place in the
new picture
One vector controls the shifting
of an entire area of the picture
that is known as a macroblock
The size of the macroblock is
determined by the DCT coding
and the color subsampling
structure Figure 2.12a shows
that, with a 4:2:0 system, the
vertical and horizontal spacing
of color samples is exactly twice
the spacing of luminance A
single 8 x 8 DCT block of color
samples extends over the same
area as four 8 x 8 luminance
blocks; therefore this is the
minimum picture area which
can be shifted by a vector One
4:2:0 macroblock contains four
luminance blocks: one Cr block
and one Cb block
In the 4:2:2 profile, color is only
subsampled in the horizontal
axis Figure 2.12b shows that in
4:2:2, a single 8 x 8 DCT block
of color samples extends over
two luminance blocks A 4:2:2
macroblock contains four
lumi-nance blocks: two Cr blocks and
two Cb blocks
The motion estimator works by
comparing the luminance data
from two successive pictures A
macroblock in the first picture
is used as a reference When the
input is interlaced, pixels will
be at different vertical locations
in the two fields, and it will,
therefore, be necessary to
inter-polate one field before it can be
compared with the other The
correlation between the reference
and the next picture is measured
at all possible displacements
Actions:
1 Compute Motion Vector
2 Shift Data from Picture N Using Vector to Make Predicted Picture N+1
3 Compare Actual Picture with Predicted Picture
4 Send Vector and Prediction Error
MotionVectorPart of Moving Object
a) 4:2:0 has 1/4 as many chroma sampling points as Y
b) 4:2:2 has twice as much chroma data as 4:2:0
88
88
88
88
88
88
88
88
88
88
4 x Y
88
88
Figure 2.12.
Trang 14Figure 2.13.
2.10 I, P and B pictures
In MPEG, three different types
of pictures are needed to supportdifferential and bidirectionalcoding while minimizing errorpropagation:
I pictures are Intra-coded picturesthat need no additional informa-tion for decoding They require
a lot of data compared to otherpicture types, and therefore theyare not transmitted any morefrequently than necessary Theyconsist primarily of transformcoefficients and have no vectors
I pictures allow the viewer toswitch channels, and they arresterror propagation
P pictures are forward Predictedfrom an earlier picture, whichcould be an I picture or a P pic-ture P-picture data consists ofvectors describing where, in theprevious picture, each macro-block should be taken from, andnot of transform coefficients thatdescribe the correction or differ-ence data that must be added tothat macroblock P picturesrequire roughly half the data of
moved backwards in time tocreate part of an earlier picture
Figure 2.13 shows the concept
of bidirectional coding On anindividual macroblock basis, abidirectionally coded picturecan obtain motion-compensateddata from an earlier or later picture, or even use an average
of earlier and later data
Bidirectional coding significantlyreduces the amount of differencedata needed by improving thedegree of prediction possible
MPEG does not specify how anencoder should be built, onlywhat constitutes a compliant bitstream However, an intelligentcompressor could try all threecoding strategies and select theone that results in the least data
to be transmitted
2.9 Bidirectional codingWhen an object moves, it concealsthe background at its leadingedge and reveals the background
at its trailing edge The revealedbackground requires new data to
be transmitted because the area
of background was previouslyconcealed and no informationcan be obtained from a previouspicture A similar problemoccurs if the camera pans: newareas come into view and nothing
is known about them MPEGhelps to minimize this problem
by using bidirectional coding,which allows information to betaken from pictures before andafter the current picture If abackground is being revealed, itwill be present in a later picture,and the information can be
Revealed Area isNot in Picture (N)
Revealed Area is
in Picture N+2
Picture(N)
Picture(N+1)
Picture(N+2)
TIM
E
Trang 15Figure 2.14.
Figure 2.14 introduces the
concept of the GOP or Group
Of Pictures The GOP begins
with an I picture and then has
P pictures spaced throughout
The remaining pictures are
B pictures The GOP is defined
as ending at the last picture
before the next I picture The
GOP length is flexible, but 12 or
15 pictures is a common value
Clearly, if data for B pictures are
to be taken from a future picture,
that data must already be
avail-able at the decoder Consequently,
bidirectional coding requires
that picture data is sent out of
sequence and temporarily
stored Figure 2.14 also shows
that the P-picture data are sent
before the B-picture data Note
that the last B pictures in the
GOP cannot be transmitted until
after the I picture of the next
GOP since this data will be
needed to bidirectionally decode
them In order to return pictures
to their correct sequence, a
tem-poral reference is included with
each picture As the picture rate
is also embedded periodically in
headers in the bit stream, an
MPEG file may be displayed by,
for example, a personal computer,
in the correct order and timescale
Sending picture data out of
sequence requires additional
memory at the encoder and
decoder and also causes delay
The number of bidirectionally
Rec 601Video Frames
ElementaryStreamtemporal_reference
LowerQuality
HigherQuality
ConstantQualityCurve
to edit is important, an IBsequence is a useful compromise
Trang 16Figure 2.16a.
the data pass straight through to
be spatially coded Subtractoroutput data also pass to a framestore that can hold several pic-tures The I picture is held inthe store
order The data then enter thesubtractor and the motion esti-mator To create an I picture, seeFigure 2.16a, the end of theinput delay is selected and thesubtractor is turned off, so that
2.11 An MPEG compressorFigures 2.16a, b, and c show atypical bidirectional motioncompensator structure Pre-processed input video enters aseries of frame stores that can bebypassed to change the picture
TablesIn
SpatialCoder
RateControl
BackwardPredictionError
Disablefor I,P
SpatialData
VectorsOut
ForwardPredictionError
Norm
Reorder
ForwardVectors
BackwardVectors
GOPControl
CurrentPicture
OutF
ForwardPredictor
BackwardPredictor
PastPicture
MotionEstimator
FuturePicture
SpatialDecoder
SpatialDecoder
SubtractPass (I)
I Pictures
(shaded areas are unused)
Trang 17Figure 2.16b.
TablesIn
SpatialCoder
RateControl
BackwardPredictionError
Disablefor I,P
SpatialData
VectorsOut
ForwardPredictionError
Norm
Reorder
ForwardVectors
BackwardVectors
GOPControl
CurrentPicture
OutF
BackwardPredictor
FuturePicture
SpatialDecoder
P Pictures
(shaded areas are unused)
ForwardPredictor
MotionEstimator
PastPicture
SpatialDecoder
SubtractPass (I)
To encode a P picture, see
Figure 2.16b, the B pictures in
the input buffer are bypassed, so
that the future picture is selected
The motion estimator compares
the I picture in the output store
with the P picture in the inputstore to create forward motionvectors The I picture is shifted
by these vectors to make a dicted P picture The predicted
pre-P picture is subtracted from theactual P picture to produce the
prediction error, which is spatially coded and sent alongwith the vectors The predictionerror is also added to the pre-dicted P picture to create a locallydecoded P picture that alsoenters the output store
Trang 18Figure 2.16c.
TablesIn
SpatialCoder
RateControl
BackwardPredictionError
Disablefor I,P
SpatialData
VectorsOut
ForwardPredictionError
Norm
Reorder
ForwardVectors
BackwardVectors
GOPControl
CurrentPicture
OutF
MotionEstimator
PastPicture
SpatialDecoder
SubtractPass (I)
Forward
- Backward Decision
BackwardPredictor
FuturePicture
SpatialDecoder
B Pictures
(shaded area is unused)
output is spatially coded andthe vectors are added in a multi-plexer Syntactical data is alsoadded which identifies the type
of picture (I, P or B) and vides other information to help adecoder (see section 4) The out-put data are buffered to allowtemporary variations in bit rate
pro-If the bit rate shows a long termincrease, the buffer will tend tofill up and to prevent overflowthe quantization process willhave to be made more severe.Equally, should the buffer showsigns of underflow, the quantiza-tion will be relaxed to maintainthe average bit rate This meansthat the store contains exactlywhat the store in the decoderwill contain, so that the results
of all previous coding errors arepresent These will automatically
be reduced when the predicted
forward or backward data areselected according to which rep-resent the smallest differences
The picture differences are thenspatially coded and sent withthe vectors
When all of the intermediate
B pictures are coded, the inputmemory will once more bebypassed to create a new P pic-ture based on the previous
P picture
Figure 2.17 shows an MPEGcoder The motion compensator
The output store then contains
an I picture and a P picture A
B picture from the input buffercan now be selected The motioncompensator, see Figure 2.16c,will compare the B picture withthe I picture that precedes it andthe P picture that follows it toobtain bidirectional vectors
Forward and backward motioncompensation is performed toproduce two predicted B pictures
These are subtracted from thecurrent B picture On a macro-block-by-macroblock basis, the
In
OutDemandClock
QuantizingTablesSpatialDataMotionVectors
SyntacticalData
Entropy andRun Length CodingDifferentialCoder
Bidirectional
Coder
(Fig 2.13)
Trang 19this behavior would result in aserious increase in vector data.
In 60 Hz video, 3:2 pulldown isused to obtain 60 Hz from 24 Hzfilm One frame is made intotwo fields, the next is made intothree fields, and so on
Consequently, one field in five
is completely redundant MPEGhandles film material best bydiscarding the third field in 3:2systems A 24 Hz code in thetransmission alerts the decoder
to recreate the 3:2 sequence byre-reading a field store In 50and 60 Hz telecine, pairs offields are deinterlaced to createframes, and then motion is measured between frames Thedecoder can recreate interlace
by reading alternate lines in theframe store
A cut is a difficult event for acompressor to handle because itresults in an almost completeprediction failure, requiring alarge amount of correction data
If a coding delay can be tolerated,
a coder may detect cuts inadvance and modify the GOPstructure dynamically, so thatthe cut is made to coincide withthe generation of an I picture Inthis case, the cut is handledwith very little extra data Thelast B pictures before the I framewill almost certainly need touse forward prediction In someapplications that are not real-time, such as DVD mastering, acoder could take two passes atthe input video: one pass toidentify the difficult or highentropy areas and create a codingstrategy, and a second pass toactually compress the input video
If a high compression factor isrequired, the level of artifactscan increase, especially if inputquality is poor In this case, itmay be better to reduce theentropy entering the coder usingprefiltering The video signal issubject to two-dimensional,low-pass filtering, whichreduces the number of coeffi-cients needed and reduces thelevel of artifacts The picturewill be less sharp, but lesssharpness is preferable to a highlevel of artifacts
In most MPEG-2 applications,4:2:0 sampling is used, whichrequires a chroma downsam-pling process if the source is4:2:2 In MPEG-1, the luminanceand chroma are further down-sampled to produce an inputpicture or SIF (Source InputFormat), that is only 352-pixelswide This technique reducesthe entropy by a further factor
For very high compression, theQSIF (Quarter Source InputFormat) picture, which is 176-pixels wide, is used
Downsampling is a process thatcombines a spatial low-pass filterwith an interpolator Downsam-pling interlaced signals is prob-lematic because vertical detail isspread over two fields whichmay decorrelate due to motion
When the source material istelecine, the video signal hasdifferent characteristics thannormal video In 50 Hz video,pairs of fields represent thesame film frame, and there is nomotion between them Thus, themotion between fields alternatesbetween zero and the motionbetween frames Since motionvectors are sent differentially,
2.12 Preprocessing
A compressor attempts to
elimi-nate redundancy within the
picture and between pictures
Anything which reduces that
redundancy is undesirable
Noise and film grain are
particu-larly problematic because they
generally occur over the entire
picture After the DCT process,
noise results in more non-zero
coefficients, which the coder
cannot distinguish from genuine
picture data Heavier quantizing
will be required to encode all of
the coefficients, reducing picture
quality Noise also reduces
simi-larities between successive
pic-tures, increasing the difference
data needed
Residual subcarrier in video
decoded from composite video
is a serious problem because it
results in high, spatial frequencies
that are normally at a low level
in component programs
Subcar-rier also alternates from picture
to picture causing an increase in
difference data Naturally, any
composite decoding artifact
that is visible in the input to the
MPEG coder is likely to be
reproduced at the decoder
Any practice that causes
un-wanted motion is to be avoided
Unstable camera mountings, in
addition to giving a shaky picture,
increase picture differences and
vector transmission requirements
This will also happen with
telecine material if film weave
or hop due to sprocket hole
damage is present In general,
video that is to be compressed
must be of the highest quality
possible If high quality cannot
be achieved, then noise reduction
and other stabilization techniques
will be desirable
Trang 20Figure 2.18.
picture with moderate noise ratio results If, however,that picture is locally decodedand subtracted pixel-by-pixelfrom the original, a quantizingnoise picture results This picturecan be compressed and trans-mitted as the helper signal Asimple decoder only decodesthe main, noisy bit stream, but amore complex decoder candecode both bit streams andcombine them to produce a low noise picture This is theprinciple of SNR scaleability
signal-to-As an alternative, coding onlythe lower spatial frequencies in
a HDTV picture can produce amain bit stream that an SDTVreceiver can decode If the lowerdefinition picture is locallydecoded and subtracted fromthe original picture, a definition-enhancing picture would result.This picture can be coded into ahelper signal A suitable decodercould combine the main andhelper signals to recreate theHDTV picture This is the principle of Spatial scaleability.The High profile supports bothSNR and spatial scaleability aswell as allowing the option of4:2:2 sampling
The 4:2:2 profile has beendeveloped for improved compat-ibility with digital productionequipment This profile allows4:2:2 operation without requiringthe additional complexity ofusing the high profile Forexample, an HP@ML decodermust support SNR scaleability,which is not a requirement forproduction The 4:2:2 profilehas the same freedom of GOPstructure as other profiles, but
in practice it is commonly usedwith short GOPs making editingeasier 4:2:2 operation requires ahigher bit rate than 4:2:0, andthe use of short GOPs requires
an even higher bit rate for agiven quality
low level uses a low resolutioninput having only 352 pixelsper line The majority of broad-cast applications will requirethe MP@ML (Main Profile atMain Level) subset of MPEG,which supports SDTV (StandardDefinition TV)
The high-1440 level is a highdefinition scheme that doublesthe definition compared to themain level The high level notonly doubles the resolution butmaintains that resolution with16:9 format by increasing thenumber of horizontal samplesfrom 1440 to 1920
In compression systems usingspatial transforms and requan-tizing, it is possible to producescaleable signals A scaleableprocess is one in which the inputresults in a main signal and a
"helper" signal The main signalcan be decoded alone to give apicture of a certain quality, but,
if the information from the helpersignal is added, some aspect ofthe quality can be improved
For example, a conventionalMPEG coder, by heavily requan-tizing coefficients, encodes a
2.13 Profiles and levelsMPEG is applicable to a widerange of applications requiringdifferent performance and com-plexity Using all of the encodingtools defined in MPEG, there aremillions of combinations possi-ble For practical purposes, theMPEG-2 standard is dividedinto profiles, and each profile
is subdivided into levels (seeFigure 2.18) A profile is basically
a subset of the entire codingrepertoire requiring a certaincomplexity A level is a parame-ter such as the size of the picture
or bit rate used with that profile
In principle, there are 24 nations, but not all of these havebeen defined An MPEG decoderhaving a given Profile and Levelmust also be able to decodelower profiles and levels
combi-The simple profile does not port bidirectional coding, and soonly I and P pictures will beoutput This reduces the codingand decoding delay and allowssimpler hardware The simpleprofile has only been defined atMain level (SP@ML)
sup-The Main Profile is designed for
a large proportion of uses The
80 Mb/sI,P,B4:2:01440x1152
60 Mb/sI,P,B4:2:0720x576
15 Mb/sI,P,B4:2:0352x288
4 Mb/sI,P,B
4:2:0720x576
15 Mb/sI,P
4:2:2720x608
50 Mb/sI,P,B
4:2:2PROFILE
4:2:0352x288
4 Mb/sI,P,B
4:2:0720x576
15 Mb/sI,P,B
4:2:01440x1152
60 Mb/sI,P,B
4:2:0, 4:2:21440x1152
80 Mb/sI,P,B4:2:0, 4:2:2720x576
20 Mb/sI,P,B
4:2:0, 4:2:21920x1152
100 Mb/sI,P,B
Trang 21Figure 2.19.
of pitch in steady tones
For video coding, wavelets havethe advantage of producing reso-lution scaleable signals withalmost no extra effort In movingvideo, the advantages of waveletsare offset by the difficulty ofassigning motion vectors to avariable size block, but in still-picture or I-picture coding this
Figure 2.19 contrasts the fixedblock size of the DFT/DCT withthe variable size of the wavelet
Wavelets are especially usefulfor audio coding because theyautomatically adapt to the con-flicting requirements of theaccurate location of transients intime and the accurate assessment
2.14 Wavelets
All transforms suffer from
uncertainty because the more
accurately the frequency domain
is known, the less accurately the
time domain is known (and vice
versa) In most transforms such
as DFT and DCT, the block
length is fixed, so the time and
frequency resolution is fixed
The frequency coefficients
rep-resent evenly spaced values on
a linear scale Unfortunately,
because human senses are
loga-rithmic, the even scale of the
DFT and DCT gives inadequate
frequency resolution at one end
and excess resolution at the other
The wavelet transform is not
affected by this problem because
its frequency resolution is a
fixed fraction of an octave and
therefore has a logarithmic
characteristic This is done by
changing the block length as a
function of frequency As
fre-quency goes down, the block
becomes longer Thus, a
charac-teristic of the wavelet transform
is that the basis functions all
contain the same number of
cycles, and these cycles are
sim-ply scaled along the time axis to
search for different frequencies
FFT
WaveletTransform
ConstantSize Windows
in FFT
ConstantNumber of Cycles
in Basis Function
Trang 22Figure 3.1.
frequencies available determinesthe frequency range of humanhearing, which in most people
is from 20 Hz to about 15 kHz.Different frequencies in theinput sound cause differentareas of the membrane tovibrate Each area has differentnerve endings to allow pitchdiscrimination The basilarmembrane also has tiny musclescontrolled by the nerves thattogether act as a kind of positivefeedback system that improvesthe Q factor of the resonance.The resonant behavior of thebasilar membrane is an exactparallel with the behavior of atransform analyzer According
to the uncertainty theory oftransforms, the more accuratelythe frequency domain of a signal
is known, the less accurately thetime domain is known
Consequently, the more able atransform is able to discriminatebetween two frequencies, theless able it is to discriminatebetween the time of two events.Human hearing has evolvedwith a certain compromise thatbalances time-uncertainty discrimination and frequencydiscrimination; in the balance,neither ability is perfect
The imperfect frequency discrimination results in theinability to separate closelyspaced frequencies This inability
is known as auditory masking,defined as the reduced sensitivity
to sound in the presence
of another
The physical hearing mechanismconsists of the outer, middleand inner ears The outer earcomprises the ear canal and theeardrum The eardrum convertsthe incident sound into a vibra-tion in much the same way asdoes a microphone diaphragm
The inner ear works by sensingvibrations transmitted through afluid The impedance of fluid ismuch higher than that of air andthe middle ear acts as an imped-ance-matching transformer thatimproves power transfer
Figure 3.1 shows that vibrationsare transferred to the inner ear
by the stirrup bone, which acts
on the oval window Vibrations
in the fluid in the ear travel upthe cochlea, a spiral cavity inthe skull (shown unrolled inFigure 3.1 for clarity) The basilarmembrane is stretched acrossthe cochlea This membranevaries in mass and stiffnessalong its length At the end nearthe oval window, the membrane
is stiff and light, so its resonantfrequency is high At the distantend, the membrane is heavy andsoft and resonates at low fre-quency The range of resonant
SECTION 3 AUDIO COMPRESSIONLossy audio compression isbased entirely on the character-istics of human hearing, whichmust be considered before anydescription of compression ispossible Surprisingly, humanhearing, particularly in stereo, isactually more critically discrim-inating than human vision, andconsequently audio compressionshould be undertaken with care
As with video compression,audio compression requires anumber of different levels ofcomplexity according to therequired compression factor
3.1 The hearing mechanismHearing comprises physicalprocesses in the ear and nervous/
mental processes that combine
to give us an impression ofsound The impression wereceive is not identical to theactual acoustic waveform present
in the ear canal because someentropy is lost Audio compres-sion systems that lose only thatpart of the entropy that will belost in the hearing mechanismwill produce good results
OuterEar
EarDrumMiddleEar
StirrupBone
Basilar Membrane
InnerEarCochlea (unrolled)
Trang 23Figure 3.2a.
3.2 Subband codingFigure 3.4 shows a band-splittingcompandor The band-splittingfilter is a set of narrow-band,linear-phase filters that overlapand all have the same band-width The output in each bandconsists of samples representing
a waveform In each frequencyband, the audio input is ampli-fied up to maximum level prior
to transmission Afterwards, eachlevel is returned to its correctvalue Noise picked up in thetransmission is reduced in eachband If the noise reduction iscompared with the threshold ofhearing, it can be seen thatgreater noise can be tolerated insome bands because of masking
Consequently, in each band aftercompanding, it is possible toreduce the wordlength of sam-ples This technique achieves acompression because the noiseintroduced by the loss of resolu-tion is masked
be present for at least about 1 millisecond before it becomesaudible Because of this slowresponse, masking can still takeplace even when the two signalsinvolved are not simultaneous
Forward and backward maskingoccur when the masking soundcontinues to mask sounds atlower levels before and after themasking sound's actual duration
Figure 3.3 shows this concept
Masking raises the threshold ofhearing, and compressors takeadvantage of this effect by raisingthe noise floor, which allowsthe audio waveform to beexpressed with fewer bits Thenoise floor can only be raised atfrequencies at which there iseffective masking To maximizeeffective masking, it is necessary
to split the audio spectrum intodifferent frequency bands toallow introduction of differentamounts of companding andnoise in each band
Figure 3.2a shows that the
threshold of hearing is a function
of frequency The greatest
sensi-tivity is, not surprisingly, in the
speech range In the presence of
a single tone, the threshold is
modified as in Figure 3.2b Note
that the threshold is raised for
tones at higher frequency and to
some extent at lower frequency
In the presence of a complex
input spectrum, such as music,
the threshold is raised at nearly
all frequencies One consequence
of this behavior is that the hiss
from an analog audio cassette is
only audible during quiet
pas-sages in music Companding
makes use of this principle by
amplifying low-level audio
sig-nals prior to recording or
trans-mission and returning them to
their correct level afterwards
The imperfect time
discrimina-tion of the ear is due to its
reso-nant response The Q factor is
such that a given sound has to
20 Hz
MaskingThreshold
Sub-BandFilter
LevelDetectX
Trang 24Figure 3.5.
be reversed at the decoder.The filter bank output is alsoanalyzed to determine the spec-trum of the input signal Thisanalysis drives a masking modelthat determines the degree ofmasking that can be expected ineach band The more maskingavailable, the less accurate thesamples in each band can be.The sample accuracy is reduced
by requantizing to reducewordlength This reduction isalso constant for every word in
a band, but different bands canuse different wordlengths Thewordlength needs to be trans-mitted as a bit allocation codefor each band to allow thedecoder to deserialize the bitstream properly
3.3 MPEG Layer 1Figure 3.6 shows an MPEGLevel 1 audio bit stream.Following the synchronizingpattern and the header, there are32-bit allocation codes of fourbits each These codes describethe wordlength of samples ineach subband Next come the 32scale factors used in the com-panding of each band Thesescale factors determine the gainneeded in the decoder to returnthe audio to the correct level.The scale factors are followed,
in turn, by the audio data ineach band
Figure 3.7 shows the Layer 1decoder The synchronizationpattern is detected by the timinggenerator, which deserializes thebit allocation and scale factordata The bit allocation data thenallows deserialization of thevariable length samples Therequantizing is reversed and thecompression is reversed by thescale factor data to put eachband back to the correct level.These 32 separate bands arethen combined in a combinerfilter which produces the audio output
output of the filter there are 12samples in each of 32 bands
Within each band, the level isamplified by multiplication tobring the level up to maximum
The gain required is constant forthe duration of a block, and asingle scale factor is transmittedwith each block for each band
in order to allow the process to
Figure 3.5 shows a simple splitting coder as is used inMPEG Layer 1 The digitalaudio input is fed to a band-splitting filter that divides thespectrum of the signal into anumber of bands In MPEG thisnumber is 32 The time axis isdivided into blocks of equallength In MPEG layer 1, this is
band-384 input samples, so in the
In
OutMUX
X32
MaskingThresholds
DynamicBit andScale FactorAllocaterand Coder
}
SubbandSamples
384 PCM Audio Input SamplesDuration 8 msec @ 48 kHzHeader
20 Bit System
12 Bit Sync
CRC
OptionalBit Allocation
32InputFilterX
ScaleFactors
X32
X32SamplesX32Bit
AllocationDemux
Figure 3.6.
Figure 3.7.