Video Sequence CompressionDiscussion •Quantization•Coding of Quantized Symbols 55.3 Desirable FeaturesScalability•Error Resilience55.4 Standards H.261 •MPEG-1•MPEG-2•H.263•MPEG-4 Acknowl
Trang 1Osama Al-Shaykh, et Al “Video Sequence Compression.”
2000 CRC Press LLC <http://www.engnetbase.com>.
Trang 2Video Sequence Compression
Discussion •Quantization•Coding of Quantized Symbols
55.3 Desirable FeaturesScalability•Error Resilience55.4 Standards
H.261 •MPEG-1•MPEG-2•H.263•MPEG-4
AcknowledgmentReferences
The image and video processing literature is rich with video compression algorithms.This chapter overviews the basic blocks of most video compression systems, discussessome important features required by many applications, e.g., scalability and error re-silience, and reviews the existing video compression standards such as H.261, H.263,MPEG-1, MPEG-2, and MPEG-4
55.1 Introduction
Video sources produce data at very high bit rates In many applications, the available bandwidth isusually very limited For example, the bit rate produced by a 30 frame/s color common intermediateformat (CIF) (352× 288) video source is 73 Mbits/s In order to transmit such a sequence over a
64 Kbits/s channel (e.g., ISDN line), we need to compress the video sequence by a factor of 1140 Asimple approach is to subsample the sequence in time and space For example, if we subsample bothchroma components by 2 in each dimension, i.e., 4:2:0 format, and the whole sequence temporally
by 4, the bit rate becomes 9.1 Mbits/s However, to transmit the video over a 64 kbits/s channel, it
is necessary to compress the subsampled sequence by another factor of 143 To achieve such highcompression ratios, we must tolerate some distortion in the subsampled frames
Compression can be either lossless (reversible) or lossy (irreversible) A compression algorithm islossless if the signal can be reconstructed from the compressed information; otherwise it is lossy The
compression performance of any lossy algorithm is usually described in terms of its rate-distortion
curve, which represents the potential trade-off between the bit rate and the distortion associated withthe lossy representation The primary goal of any lossy compression algorithm is to optimize therate-distortion curve over some range of rates or levels of distortion For video applications, rate
Trang 3is usually expressed in terms of bits per second The distortion is usually expressed in terms of thepeak-signal-to-noise ratio (PSNR) per frame or, in some cases, measures that try to quantify thesubjective nature of the distortion.
In addition to good compression performance, many other properties may be important or evencritical to the applicability of a given compression algorithm Such properties include robustness
to errors in the compressed bit stream, low complexity encoders and decoders, low latency ments, and scalability Developing scalable video compression algorithms has attracted considerableattention in recent years Generally speaking, scalability refers to the potential to effectively decom-press subsets of the compressed bit stream in order to satisfy some practical constraint, e.g., displayresolution, decoder computational complexity, and bit rate limitations
require-The demand for compatible video encoders and decoders has resulted in the development ofdifferent video compression standards The international standards organization (ISO) has developedMPEG-1 to store video on compact discs, MPEG-2 for digital television, and MPEG-4 for a widerange of applications including multimedia The international telecommunication union (ITU) hasdeveloped H.261 for video conferencing and H.263 for video telephony
All existing video compression standards are hybrid systems That is, the compression is achieved
in two main stages The first stage, motion compensation and estimation, predicts each frame fromits neighboring frames, compresses the prediction parameters, and produces the prediction errorframe The second stage codes the prediction error All existing standards use block-based discretecosine transform (DCT) to code the residual error In addition to DCT, others non-block-basedcoders, e.g., wavelets and matching pursuit, can be used
In this chapter, we will provide an overview of hybrid video coding systems In Section55.2, wediscuss the main parts of a hybrid video coder This includes motion compensation, signal decompo-sitions and transformations, quantization, and entropy coding We compare various transformationssuch as DCT, subband, and matching pursuit In Section55.3, we discuss scalability and error re-silience in video compression systems We also describe a non-hybrid video coder that providesscalable bit-streams [28] Finally, in Section55.4, we review the key video compression standards:H.261, H.263, MPEG 1, MPEG 2, and MPEG 4
55.2 Motion Compensated Video Coding
Virtually all video compression systems identify and reduce four basic types of video data dancy: inter-frame (temporal) redundancy, interpixel redundancy, psychovisual redundancy, andcoding redundancy Figure55.1 shows a typical diagram of a hybrid video compression system.First the current frame is predicted from previously decoded frames by estimating the motion ofblocks or objects, thus reducing the inter-frame redundancy Afterwards to reduce the interpixelredundancy, the residual error after frame prediction is transformed to another format or domainsuch that the energy of the new signal is concentrated in few components and these componentsare as uncorrelated as possible The transformed signal is then quantized according to the desiredcompression performance (subjective or objective) The quantized transform coefficients are thenmapped to codewords that reduce the coding redundancy The rest of this section will discuss theblocks of the hybrid system in more detail
redun-55.2.1 Motion Estimation and Compensation
Neighboring frames in typical video sequences are highly correlated This inter-frame (temporal)redundancy can be significantly reduced to produce a more compressible sequence by predictingeach frame from its neighbors Motion compensation is a nonlinear predictive technique in whichthe feedback loop contains both the inverse transformation and the inverse quantization blocks, as
1999 by CRC Press LLC
Trang 4FIGURE 55.1: Motion compensated coding of video.
shown in Fig.55.1
Most motion compensation techniques divide the frame into regions, e.g., blocks Each region
is then predicted from the neighboring frames The displacement of the block or region,d, is not
fixed and must be encoded as side information in the bit stream In some cases, different predictionmodels are used to predict regions, e.g., affine transformations These prediction parameters shouldalso be encoded in the bit stream
To minimize the amount of side information, which must be included in the bit stream, and tosimplify the encoding process, motion estimation is usually block based That is, every pixel Ei in a
given rectangular block is assigned the same motion vector,d Block-based motion estimation is an
integral part of all existing video compression standards
55.2.2 Transformations
Most image and video compression schemes apply a transformation to the raw pixels or to the residualerror resulting from motion compensation before quantizing and coding the resulting coefficients.The function of the transformation is to represent the signal in a few uncorrelated components Themost common transformations are linear transformations, i.e., the multi-dimensional sequence ofinput pixel values,f [Ei], is represented in terms of the transform coefficients, t[Ek], via
f [Ei] =X
Ek
t[Ek]w Ek [Ei] (55.1)
for somew Ek [Ei] The input image is thus represented as a linear combination of basis vectors, w Ek
It is important to note that the basis vectors need not be orthogonal They only need to form anover-complete set (matching pursuits), a complete set (DCT and some subband decompositions), orvery close to complete (some subband decompositions) This is important since the coder should beable to code a variety of signals The remainder of the section discusses and compares DCT, subbanddecompositions, and matching pursuits
The DCT
There are two properties desirable in a unitary transform for image compression: the energyshould be packed into a few transform coefficients, and the coefficients should be as uncorrelated
Trang 5as possible The optimum transform under these two constraints is the Karhunen-Lo´eve transform(KLT) where the eigenvectors of the covariance matrix of the image are the vectors of the trans-form [10] Although the KLT is optimal under these two constraints, it is data-dependent, and isexpensive to compute The discrete cosine transform (DCT) performs very close to KLT especiallywhen the input is a first order Markov process [10].
The DCT is a block-based transform That is, the signal is divided into blocks, which are dently transformed using orthonormal discrete cosines The DCT coefficients of a one-dimensionalsignal,f , are computed via
whereN is the size of the block and b denotes the block number.
The orthonormal basis vectors associated with the one-dimensional DCT transformation of
Figure55.2(a) shows these basis vectors forN = 8.
FIGURE 55.2: DCT basis vectors (N = 8): (a) one-dimensional and (b) separable two-dimensional.
The one-dimensional DCT described above is usually separably extended to two dimensions forimage compression applications In this case, the two-dimensional basis vectors are formed by thetensor product of one-dimensional DCT basis vectors and are given by
Trang 6Figure55.2(b) shows the two-dimensional basis vectors forN = 8.
The DCT is the most common transform in video compression It is used in the JPEG still imagecompression standard, and all existing video compression standards This is because it performsreasonably well at different bit rates Moreover, there are fast algorithms and special hardware chips
to compute the DCT efficiently
The major objection to the DCT in image or video compression applications is that the overlapping blocks of basis vectors,w Ek, are responsible for distinctly “blocky” artifacts in the de-compressed frames, especially at low bit rates This is due to the quantization of the transformcoefficients of a block independent from neighboring blocks Overlapped DCT representation ad-dresses this problem [15]; however, the common solution is to post-process the frame by smoothingthe block boundaries [18,22]
non-Due to bit rate restrictions, some blocks are only represented by one or a small number of coarselyquantized transform coefficients, hence the decompressed block will only consist of these basis vectors.This will cause artifacts commonly known as ringing and mosquito noise
Figure55.8(b) shows frame 250 of the 15 frame/s CIF Coast-guard sequence coded at 112 Kbits/susing a DCT hybrid video coder.1This figure provides a good illustration of the “blocking” artifacts
Subband Decomposition
The basic idea of subband decomposition is to split the frequency spectrum of the image into(disjoint) subbands This is efficient when the image spectrum is not flat and is concentrated in a fewsubbands, which is usually the case Moreover, we can quantize the subbands differently according
to their visual importance
As for the DCT, we begin our discussion of subband decomposition by considering only a dimensional source sequence,f [i] Figure55.3provides a general illustration of anN-band one-
one-dimensional subband system We refer to the subband decomposition itself as analysis and to the
FIGURE 55.3: 1D,N-band subband analysis and synthesis block diagrams (Source: Taubman, D.,
Chang, E., and Zakhor, A., Directionality and scalability in subband image and video compression,
in Image Technology: Advances in Image Processing, Multimedia, and Machine Vision, Jorge L.C Sanz,
Ed., Springer-Verlag, New York, 1996 With permission)
inverse transformation as synthesis The transformation coefficients of bands 1 , 2, , N are denoted
by the sequencesu1[k], u2[k], , uN[k], respectively For notational convenience and consistency
with the DCT formulation above, we writetSB[·] for the sequence of all subband coefficients, arranged
1 It is coded using H.263 [ 3 ], which is an ITU standard.
Trang 7according tot [(β −1)+Nk] = uβ [k], where 1 ≤ β ≤ N is the subband number These coefficients
are generated by filtering the input sequence with filtersH1, , H Nand downsampling the filteredsequences by a factor ofN, as depicted in Fig.55.3 In subband synthesis, the coefficients for eachband are upsampled, interpolated with the synthesis filters,G1, , G N, and the results summed toform a reconstructed sequence, ˜f [i], as depicted in Fig.55.3
If the reconstructed sequence, ˜f [i], and the source sequence, f [i], are identical, then the subband
system is referred to as perfect reconstruction (PR) and the corresponding basis set is a completebasis set Although perfect reconstruction is a desirable property, near perfect reconstruction (NPR),for which subband synthesis is only approximately the inverse of subband analysis, is often sufficient
in practice This is because distortion introduced by quantization of the subband coefficients,tSB[k],
usually dwarfs that introduced by an imperfect synthesis system
The filters,H1, , H N, are usually designed to have band-pass frequency responses, as indicated
in Fig.55.4, so that the coefficientsu β [k] for each subband, 1 ≤ β ≤ N, represent different spectral
components of the source sequence
FIGURE 55.4: Typical analysis filter magnitude responses (Source: Taubman, D., Chang, E., and khor, A., Directionality and scalability in subband image and video compression, in Image Technology:
Za-Advances in Image Processing, Multimedia, and Machine Vision, Jorge L.C Sanz, Ed., Springer-Verlag,
New York, 1996 With permission)
The basis vectors for subband decomposition are the N-translates of the impulse responses,
g1[i], , g N [i], of synthesis filters G1, , G N Specifically, denoting the kth basis vector
as-sociated with subbandβ by wSB
Nk+β−1, we have
wSB
Nk + β − 1 [i] = gβ [i − Nk] (55.4)Figure55.5illustrates five of the basis vectors for a particularly simple, yet useful, two-band PRsubband decomposition, with symmetric FIR analysis and synthesis impulse responses As shown inFig.55.5and in contrast with the DCT basis vectors, the subband basis vectors overlap
As for the DCT, one-dimensional subband decompositions may be separably extended to higherdimensions By this we mean that a one-dimensional subband decomposition is first applied alongone dimension of an image or video sequence Any or all of the resulting subbands are then furtherdecomposed into subbands along another dimension and so on Figure55.6depicts a separable two-dimensional subband system For video compression applications, the prediction error is sometimesdecomposed into subbands of equal size
Two-dimensional subband decompositions have the advantage that they do not suffer from thedisturbing blocking artifacts exhibited by the DCT at high compression ratios Instead, the mostnoticeable quantization-induced distortion tends to be ‘ringing’ or ‘rippling’ artifacts, which becomemost bothersome in the vicinity of image edges Figures55.11(c) and 55.8(c) clearly show thiseffect Figure55.11shows frame 210 of the Ping-pong sequence compressed using a scalable, three-dimensional subband coder [28] at 1.5 Mbits/s, 300 Kbits/s, and 60 Kbits/s As the bit rate decreases,
we notice loss of detail and introduction of more ringing noise Figure55.8(c) shows frame 250 ofthe Coast-guard sequence compressed at 112 Kbits/s using a zerotree scalable coder [16] The edges
of the trees and the boat are affected by ringing noise
1999 by CRC Press LLC
Trang 8FIGURE 55.5: Subband basis vectors with N = 2, h1[−2 2] = √2 ·
8).h i andg iare the impulse responses of theH i (analysis)
andG i (synthesis) filters, respectively (Source: Taubman, D., Chang, E., and Zakhor, A.,
Direction-ality and scalability in subband image and video compression, in Image Technology: Advances in
Image Processing, Multimedia, and Machine Vision, Jorge L.C Sanz, Ed., Springer-Verlag, New York,
1996 With permission)
Matching Pursuit
Representing a signal using an over-complete basis set implies that there is more than onerepresentation for the signal For coding purposes, we are interested in representing the signal withthe fewest basis vectors This is an NP-complete problem [14] Different approaches have beeninvestigated to find or approximate the solution Matching pursuits is a multistage algorithm, which
in each stage finds the basis vector that minimizes the mean-squared-error [14]
Suppose we want to represent a signalf [i] using basis vectors from an over-complete dictionary
(basis set)G Individual dictionary vectors can be denoted as:
Hereγ is an indexing parameter associated with a particular dictionary element The decomposition
begins by choosingγ to maximize the absolute value of the following inner product:
t =< f [i], w γ [i] >, (55.6)wheret is the transform (expansion) coefficient A residual signal is computed as:
R[i] = f [i] − t w γ [i]. (55.7)This residual signal is then expanded in the same way as the original signal The procedure continuesiteratively until either a set number of expansion coefficients are generated or some energy thresholdfor the residual is reached Each stagek yields a dictionary structure specified by γ k, an expansioncoefficientt[k], and a residual R k, which is passed on to the next stage After a total ofM stages, the
signal can be approximated by a linear function of the dictionary elements:
ˆ
f [i] =XM
k=1
t[k] w γ k [i]. (55.8)
Trang 9FIGURE 55.6: Separable spatial subband pyramid Two level analysis system configuration and
subband passbands shown (Source: Taubman, D., Chang, E., and Zakhor, A., Directionality and scalability in subband image and video compression, in Image Technology: Advances in Image Process-
ing, Multimedia, and Machine Vision, Jorge L.C Sanz, Ed., Springer-Verlag, New York, 1996 With
permission)
The above technique has useful signal representation properties For example, the dictionaryelement chosen at each stage is the element that provides the greatest reduction in mean square errorbetween the true signalf [i] and the coded signal ˆ f [i] In this sense, the signal structures are coded
in order of importance, which is desirable in situations where the bit budget is limited For image andvideo coding applications, this means that the most visible features tend to be coded first Weakerimage features are coded later, if at all It is even possible to control which types of image features arecoded well by choosing dictionary functions to match the shape, scale, or frequency of the desiredfeatures
An interesting feature of the matching pursuit technique is that it places very few restrictions onthe dictionary set The original Mallat and Zhang paper considers both Gabor and wave-packetfunction dictionaries, but such structure is not required by the algorithm itself [14] Mallat andZhang showed that if the dictionary set is at least complete, then ˆf [i] will eventually converge to
f [i], though the rate of convergence is not guaranteed [14] Convergence speed and thus codingefficiency are strongly related to the choice of dictionary set However, true dictionary optimizationcan be difficult because there are so few restrictions Any collection of arbitrarily sized and shapedfunctions can be used with matching pursuits, as long as completeness is satisfied
Bergeaud and Mallat used the matching pursuit technique to represent and process images [1].Neff and Zakhor have used the matching pursuit technique to code the motion prediction errorsignal [20] Their coder divides each motion residual into blocks and measures the energy of eachblock The center of the block with the largest energy value is adopted as an initial estimate for theinner product search A dictionary of Gabor basis vectors, shown in Fig.55.7, is then exhaustivelymatched to anS × S window around the initial estimate The exhaustive search can be thought of as
follows EachN × N dictionary structure is centered at each location in the search window, and the
inner product between the structure and the correspondingN ×N region of image data is computed.
The largest inner-product is then quantized The location, basis vector index, and quantized innerproduct are then coded together
Video sequences coded using matching pursuit do not suffer from either blocking or ringingartifacts, because the basis vectors are only coded when they are well-matched to the residual signal
As bit rate decreases, the distortion introduced by matching pursuit coding takes the form of agradually increasing blurriness (or loss of detail) Since matching pursuits involves exhaustive search,
it is more complex than DCT approaches, especially at high bit rates
1999 by CRC Press LLC
Trang 10FIGURE 55.7: Separable two-dimensional 20× 20 Gabor dictionary.
Figure55.8(d) shows frame 250 of the 15 frame/s CIF Coast-guard sequence coded at 112 Kbits/susing the matching pursuit video coder described by Neff and Zakhor [20] This frame does notsuffer from the blocky artifacts, which affect the DCT coders as shown in Fig.55.8(b) Moreover, itdoes not suffer from the ringing noise, which affects the subband coders as shown in Figs.55.8(c)and55.11(c)
55.2.3 Discussion
Figure55.8shows frame 250 of the 15 frame/s CIF Coast-guard sequence coded at 112 Kbits/s usingDCT, subband, and matching pursuit coders The DCT coded frame suffers from blocking artifacts.The subband coded frame suffers from ringing artifact
Figure55.9compares the PSNR performance of the matching pursuit coder [20] to a DCT (H.263)coder [3] and a zerotree subband coder [16] when coding the Coast-guard sequence at 112 Kbits/s.The matching pursuit coder [20] in this example has consistently higher PSNR than the H.263 [3]and the zerotree subband [16] coders Table55.1shows the average luminance PSNRs for differentsequences at different bit rates In all examples mentioned in Table55.1, the matching pursuit coderhas higher average PSNR than the DCT coder The subband coder has the lowest average PSNR
TABLE 55.1 The Average Luminance PSNR of Different Sequences at Different Bit Rates When Coding Using a DCT Coder (H.263) [ 3 ], Zero-Tree Subband Coder (ZTS) [ 16 ], and Matching Pursuit Coder (MP) [ 20 ]
Sequence Format Bit Frame DCT ZTS MP Container-ship QCIF 10 K 7.5 29.43 28.01 31.10 Hall-Monitor QCIF 10 K 7.5 30.04 28.44 31.27 Mother-Daughter QCIF 10 K 7.5 32.50 31.07 32.78 Container-ship QCIF 24 K 10.0 32.77 30.44 34.26 Silent-Voice QCIF 24 K 10.0 30.89 29.41 31.71 Mother-Daughter QCIF 24 K 10.0 35.17 33.77 35.55 Coast-Guard QCIF 48 K 10.0 29.00 27.65 29.82