H.264 and MPEG-4 Video Compression phần 9 docx

FUNCTIONAL DESIGN •22732×32 block in current frame Figure 7.1 Current block white border decoder and has a signiﬁcant effect on CODEC performance.. The goal of apractical motion estimati

Trang 2

In this chapter we give an overview of practical issues related to the design of software orhardware implementations of the coding standards The design of each of the main functionalblocks of a CODEC (such as motion estimation, transform and entropy coding) can have asigniﬁcant impact on computational efﬁciency and compression performance We discuss theinterfaces to a video encoder and decoder and the value of video pre-processing to reduceinput noise and post-processing to minimise coding artefacts.

Comparing the performance of video coding algorithms is a difﬁcult task, not least cause decoded video quality is dependent on the input video material and is inherently subjec-tive We compare the subjective and objective (PSNR) coding performance of MPEG-4 Visualand H.264 reference model encoders using selected test video sequences Compression per-formance often comes at a computational cost and we discuss the computational performancerequirements of the two standards

be-The compressed video data produced by an encoder is typically stored or transmittedacross a network In many practical applications, it is necessary to control the bitrate of theencoded data stream in order to match the available bitrate of a delivery mechanism Wediscuss practical bitrate control and network transport issues

7.2 FUNCTIONAL DESIGN

Figures 3.51 and 3.52 show typical structures for a motion-compensated transform based videoencoder and decoder A practical MPEG-4 Visual or H.264 CODEC is required to implementsome or all of the functions shown in these ﬁgures (even if the CODEC structure is different

H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia.

Trang 3

DESIGN AND PERFORMANCE

•226

from that shown) Conforming to the MPEG-4/H.264 standards, whilst maintaining good pression and computational performance, requires careful design of the CODEC functionalblocks The goal of a functional block design is to achieve good rate/distortion performance(see Section 7.4.3) whilst keeping computational overheads to an acceptable level

Functions such as motion estimation, transforms and entropy coding can be highly putationally intensive Many practical platforms for video compression are power-limited orcomputation-limited and so it is important to design the functional blocks with these limita-tions in mind In this section we discuss practical approaches and tradeoffs in the design ofthe main functional blocks of a video CODEC

com-7.2.1 Segmentation

The object-based functionalities of MPEG-4 (Core, Main and related proﬁles) require a video

scene to be segmented into objects Segmentation methods usually fall into three categories:

1 Manual segmentation: this requires a human operator to identify manually the borders ofeach object in each source video frame, a very time-consuming process that is obviouslyonly suitable for ‘ofﬂine’ video content (video data captured in advance of coding andtransmission) This approach may be appropriate, for example, for segmentation of animportant visual object that may be viewed by many users and/or re-used many times indifferent composed video sequences

2 Semi-automatic segmentation: a human operator identiﬁes objects and perhaps objectboundaries in one frame; a segmentation algorithm reﬁnes the object boundaries (if neces-sary) and tracks the video objects through successive frames of the sequence

3 Fully-automatic segmentation: an algorithm attempts to carry out a complete segmentation

of a visual scene without any user input, based on (for example) spatial characteristics such

as edges and temporal characteristics such as object motion between frames

Semi-automatic segmentation [1,2] has the potential to give better results than fully-automaticsegmentation but still requires user input Many algorithms have been proposed for automaticsegmentation [3,4] In general, better segmentation performance can be achieved at the expense

of greater computational complexity Some of the more sophisticated segmentation algorithmsrequire signiﬁcantly more computation than the video encoding process itself Reasonablyaccurate segmentation performance can be achieved by spatio-temporal approaches (e.g [3])

in which a coarse approximate segmentation is formed based on spatial information and isthen reﬁned as objects move Excellent segmentation results can be obtained in controlledenvironments (for example, if a TV presenter stands in front of a blue background) but theresults for practical scenarios are less robust

The output of a segmentation process is a sequence of mask frames for each VO, eachframe containing a binary mask for one VOP (e.g Figure 5.30) that determines the processing

of MBs and blocks and is coded as a BAB in each boundary MB position

7.2.2 Motion Estimation

Motion estimation is the process of selecting an offset to a suitable reference area in a previously

coded frame (see Chapter 3) Motion estimation is carried out in a video encoder (not in a

Trang 4

FUNCTIONAL DESIGN •227

32×32 block in current frame

Figure 7.1 Current block (white border)

decoder) and has a signiﬁcant effect on CODEC performance A good choice of predictionreference minimises the energy in the motion-compensated residual which in turn maximisescompression performance However, ﬁnding the ‘best’ offset can be a very computationally-intensive procedure

The offset between the current region or block and the reference area (motion vector)may be constrained by the semantics of the coding standard Typically, the reference area isconstrained to lie within a rectangle centred upon the position of the current block or region.Figure 7.1 shows a 32× 32-sample block (outlined in white) that is to be motion-compensated.Figure 7.2 shows the same block position in the previous frame (outlined in white) and a largersquare extending± 7 samples around the block position in each direction The motion vectormay ‘point’ to any reference area within the larger square (the search area) The goal of apractical motion estimation algorithm is to ﬁnd a vector that minimises the residual energyafter motion compensation, whilst keeping the computational complexity within acceptablelimits The choice of algorithm depends on the platform (e.g software or hardware) and onwhether motion estimation is block-based or region-based

7.2.2.1 Block Based Motion Estimation

Energy Measures

Motion compensation aims to minimise the energy of the residual transform coefﬁcients afterquantisation The energy in a transformed block depends on the energy in the residual block(prior to the transform) Motion estimation therefore aims to ﬁnd a ‘match’ to the currentblock or region that minimises the energy in the motion compensated residual (the differencebetween the current block and the reference area) This usually involves evaluating the residualenergy at a number of different offsets The choice of measure for ‘energy’ affects compu-tational complexity and the accuracy of the motion estimation process Equation 7.1, equa-tion 7.2 and equation 7.3 describe three energy measures, MSE, MAE and SAE The motion

Trang 5

•228

Previous (reference) frame

Figure 7.2 Search region in previous (reference) frame

compensation block size is N × N samples; C i j and R i jare current and reference area samplesrespectively

1 Mean Squared Error: M S E= 1

SAE is probably the most widely-used measure of residual energy for reasons of tional simplicity The H.264 reference model software [5] uses SA(T)D, the sum of absolute

computa-differences of the transformed residual data, as its prediction energy measure (for both Intra

and Inter prediction) Transforming the residual at each search location increases computationbut improves the accuracy of the energy measure A simple multiply-free transform is usedand so the extra computational cost is not excessive

The results of the above example indicate that the best choice of motion vector is(+2,0) The minimum of the MSE or SAE map indicates the offset that produces a mini-mal residual energy and this is likely to produce the smallest energy of quantised transform

Trang 7

•230

Centre (0,0) position

Initial search location

Raster search order Search ‘window’

Figure 7.6 Full search (raster scan)

coefﬁcients The motion vector itself must be transmitted to the decoder, however, and aslarger vectors are coded using more bits than small-magnitude vectors (see Chapter 3) itmay be useful to ‘bias’ the choice of vector towards (0,0) This can be achieved simply bysubtracting a constant from the MSE or SAE at position (0,0) A more sophisticated approach

is to treat the choice of vector as a constrained optimisation problem [6] The H.264 referencemodel encoder [5] adds a cost parameter for each coded element (MVD, prediction mode, etc)before choosing the smallest total cost of motion prediction

It may not always be necessary to calculate SAE (or MAE or MSE) completely at each set location A popular shortcut is to terminate the calculation early once the previous minimumSAE has been exceeded For example, after calculating each inner sum of equation (7.3)(N−1

off-j=0 |C i j − R i j|), the encoder compares the total SAE with the previous minimum If thetotal so far exceeds the previous minimum, the calculation is terminated (since there is no point

in ﬁnishing the calculation if the outcome is already higher than the previous minimum SAE)

Full Search

Full Search motion estimation involves evaluating equation 7.3 (SAE) at each point in thesearch window (±S samples about position (0,0), the position of the current macroblock).Full search estimation is guaranteed to ﬁnd the minimum SAE (or MAE or MSE) in thesearch window but it is computationally intensive since the energy measure (e.g equation

(7.3)) must be calculated at every one of (2S+ 1)2locations

Figure 7.6 shows an example of a Full Search strategy The ﬁrst search location is

at the top-left of the window (position [−S, −S]) and the search proceeds in raster order

Trang 8

Initial search location

Spiral search order Search ‘window’

Figure 7.7 Full search (spiral scan)

until all positions have been evaluated In a typical video sequence, most motion vectors areconcentrated around (0,0) and so it is likely that a minimum will be found in this region.The computation of the full search algorithm can be simpliﬁed by starting the search at (0,0)and proceeding to test points in a spiral pattern around this location (Figure 7.7) If earlytermination is used (see above), the SAE calculation is increasingly likely to be terminatedearly (thereby saving computation) as the search pattern widens outwards

‘Fast’ Search Algorithms

Even with the use of early termination, Full Search motion estimation is too computationallyintensive for many practical applications In computation- or power-limited applications, so-called ‘fast search’ algorithms are preferable These algorithms operate by calculating theenergy measure (e.g SAE) at a subset of locations within the search window

The popular Three Step Search (TSS, sometimes described as N-Step Search) is illustrated

in Figure 7.8 SAE is calculated at position (0,0) (the centre of the Figure) and at eight locations

±2N−1(for a search window of±(2N − 1) samples) In the ﬁgure, S is 7 and the ﬁrst nine

search locations are numbered ‘1’ The search location that gives the smallest SAE is chosen

as the new search centre and a further eight locations are searched, this time at half the previousdistance from the search centre (numbered ‘2’ in the ﬁgure) Once again, the ‘best’ location

is chosen as the new search origin and the algorithm is repeated until the search distance

cannot be subdivided further The TSS is considerably simpler than Full Search (8N+ 1searches compared with (2N+1− 1)2 searches for Full Search) but the TSS (and other fast

Trang 9

1

2

2 2

2

3 3

3 3 3 3 3 3

Figure 7.8 Three Step Search

-10

0

10 -8 -6 -4 -2 0

Figure 7.9 SAE map showing several local minima

search algorithms) do not usually perform as well as Full Search The SAE map shown inFigure 7.5 has a single minimum point and the TSS is likely to ﬁnd this minimum correctly, butthe SAE map for a block containing complex detail and/or different moving components mayhave several local minima (e.g see Figure 7.9) Whilst the Full Search will always identify theglobal minimum, a fast search algorithm may become ‘trapped’ in a local minimum, giving asuboptimal result

Trang 10

0

1 2 3 1 1 1 1

2 2 3

Predicted vector

Figure 7.10 Nearest Neighbours Search

Many fast search algorithms have been proposed, such as Logarithmic Search, chical Search, Cross Search and One at a Time Search [7–9] In each case, the performance

Hierar-of the algorithm can be evaluated by comparison with Full Search Suitable comparisoncriteria are compression performance (how effective is the algorithm at minimising themotion-compensated residual?) and computational performance (how much computation issaved compared with Full Search?) Other criteria may be helpful; for example, some ‘fast’algorithms such as Hierarchical Search are better-suited to hardware implementation thanothers

Nearest Neighbours Search [10] is a fast motion estimation algorithm that has low putational complexity but closely approaches the performance of Full Search within the frame-work of MPEG-4 Simple Proﬁle In MPEG-4 Visual, each block or macroblock motion vector

com-is differentially encoded A predicted vector com-is calculated (based on previously-coded tors from neighbouring blocks) and the difference (MVD) between the current vector and thepredicted vector is transmitted NNS exploits this property by giving preference to vectorsthat are close to the predicted vector (and hence minimise MVD) First, SAE is evaluated atlocation (0,0) Then, the search origin is set to the predicted vector location and surroundingpoints in a diamond shape are evaluated (labelled ‘1’ in Figure 7.10) The next step depends

vec-on which of the points have the lowest SAE If the (0,0) point or the centre of the diamvec-ondhave the lowest SAE, then the search terminates If a point on the edge of the diamond hasthe lowest SAE (the highlighted point in this example), that becomes the centre of a newdiamond-shaped search pattern and the search continues In the ﬁgure, the search terminatesafter the points marked ‘3’ are searched The inherent bias towards the predicted vector givesexcellent compression performance (close to the performance achieved by full search) withlow computational complexity

Sub-pixel Motion Estimation

Chapter 3 demonstrated that better motion compensation can be achieved by allowing theoffset into the reference frame (the motion vector) to take fractional values rather than justinteger values For example, the woman’s head will not necessarily move by an integer number

of pixels from the previous frame (Figure 7.2) to the current frame (Figure 7.1) Increased

Trang 11

•234

fractional accuracy (half-pixel vectors in MPEG-4 Simple Proﬁle, quarter-pixel vectors inAdvanced Simple proﬁle and H.264) can provide a better match and reduce the energy inthe motion-compensated residual This gain is offset against the need to transmit fractionalmotion vectors (which increases the number of bits required to represent motion vectors) andthe increased complexity of sub-pixel motion estimation and compensation

Sub-pixel motion estimation requires the encoder to interpolate between integer samplepositions in the reference frame as discussed in Chapter 3 Interpolation is computationallyintensive, especially so for quarter-pixel interpolation because a high-order interpolation ﬁlter

is required for good compression performance (see Chapter 6) Calculating sub-pixel samplesfor the entire search window is not usually necessary Instead, it is sufficient to find the bestinteger-pixel match (using Full Search or one of the fast search algorithms discussed above)and then to search interpolated positions adjacent to this position In the case of quarter-pixelmotion estimation, first the best integer match is found; then the best half-pixel position match

in the immediate neighbourhood is calculated; ﬁnally the best quarter-pixel match around thishalf-pixel position is found

7.2.2.2 Object Based Motion Estimation

Chapter 5 described the process of motion compensated prediction and reconstruction(MC/MR) of boundary MBs in an MPEG-4 Core Proﬁle VOP During MC/MR, transparentpixels in boundary and transparent MBs are padded prior to forming a motion compensatedprediction In order to ﬁnd the optimum prediction for each MB, motion estimation should becarried out using the padded reference frame Object-based motion estimation consists of thefollowing steps

1 Pad transparent pixel positions in the reference VOP as described in Chapter 5

2 Carry out block-based motion estimation to ﬁnd the best match for the current MB in thepadded reference VOP If the current MB is a boundary MB, the energy measure shouldonly be calculated for opaque pixel positions in the current MB

Motion estimation for arbitrary-shaped VOs is more complex than for rectangular frames (orslices/VOs) In [11] the computation and compression performance of a number of popularmotion estimation algorithms are compared for the rectangular and object-based cases Meth-ods of padding boundary MBs using graphics co-processor functions are described in [12]and a hardware architecture for Motion Estimation, Motion Compensation and CAE shapecoding is presented in [13]

7.2.3 DCT/IDCT

The Discrete Cosine Transform is widely used in image and video compression algorithms

in order to decorrelate image or residual data prior to quantisation and compression (seeChapter 3) The basic FDCT and IDCT equations (equations (3.4) and (3.5)), if implementeddirectly, require a large number of multiplications and additions It is possible to exploit the

structure of the transform matrix A in order to signiﬁcantly reduce computational complexity

and this is one of the reasons for the popularity of the DCT

Trang 12

Direct evaluation of equation (3.4) for an 8× 8 FDCT (where N = 8) requires 64 × 64 =

4096 multiplications and accumulations From the matrix form (equation (3.1)) it is clear that

the 2D transform can be evaluated in two stages (i.e calculate AX and then multiply by matrix

AT, or vice versa) The 1D FDCT is given by equation (7.4), where f i are the N input samples and F x are the N output coefﬁcients Rearranging the 2D FDCT equation (equation (3.4))

shows that the 2D FDCT can be constructed from two 1D transforms (equation (7.5)) The2D FDCT may be calculated by evaluating a 1D FDCT of each column of the input matrix(the inner transform), then evaluating a 1D FDCT of each row of the result of the ﬁrst set oftransforms (the outer transform) The 2D IDCT can be manipulated in a similar way (equation(7.6)) Each eight-point 1D transform takes 64 multiply/accumulate operations, giving a total

of 64× 8 × 2 = 1024 multiply/accumulate operations for an 8 × 8 FDCT or IDCT

cos(2i + 1)xπ

At ﬁrst glance, calculating an eight-point 1-D FDCT (equation (7.4)) requires the evaluation ofeight different cosine factors (cos(2i+1)xπ 2N with eight values of i ) for each of eight coefﬁcient indices (x = 0 7) However, the symmetries of the cosine function make it possible to

combine many of these calculations into a reduced number of steps For example, consider

the calculation of F2(from equation (7.4)):

+ f2cos

5π8

+ f3cos

7π8

+ f4cos

9π8

+ f5cos

11π8

+ f6cos

13π8

+ f7cos

15π8

(7.7)Evaluating equation (7.7) would seem to require eight multiplications and seven additions(plus a scaling by a half) However, by making use of the symmetrical properties of the cosinefunction this can be simpliﬁed to:

In a similar way, F6may be simpliﬁed to:

once so that F2and F6can be calculated using a total of eight additions and four tions (plus a ﬁnal scaling by a half) Extending this approach to the complete 8× 8 FDCTleads to a number of alternative ‘fast’ implementations such as the popular algorithm due to

Trang 13

multiplica-DESIGN AND PERFORMANCE

Chen, Smith and Fralick [14] The data ﬂow through this 1D algorithm can be represented as

a ‘ﬂowgraph’ (Figure 7.11) In this ﬁgure, a circle indicates addition of two inputs, a square

indicates multiplication by a constant and cX indicates the constant cos(X π/16) This

algo-rithm requires only 26 additions or subtractions and 20 multiplications (in comparison withthe 64 multiplications and 64 additions required to evaluate equation (7.4))

Figure 7.11 is just one possible simpliﬁcation of the 1D DCT algorithm Many type algorithms have been developed over the years, optimised for a range of implementationrequirements (e.g minimal multiplications, minimal subtractions, etc.) Further computationalgains can be obtained by direct optimisation of the 2D DCT (usually at the expense of increasedimplementation complexity)

ﬂowgraph-Flowgraph algorithms are very popular for software CODECs where (in many cases)the best performance is achieved by minimising the number of computationally-expensivemultiply operations For a hardware implementation, regular data ﬂow may be more importantthan the number of operations and so a different approach may be required Popular hardwarearchitectures for the FDCT / IDCT include those based on parallel multiplier arrays anddistributed arithmetic [15–18]

The integer IDCT approximations speciﬁed in the H.264 standard have been designed to besuitable for fast, efﬁcient software and hardware implementation The original proposal for the

Trang 14

Figure 7.12 8× 8 block in boundary MB

forward and inverse transforms [19] describes alternative implementations using (i) a series ofshifts and additions (‘shift and add’), (ii) a ﬂowgraph algorithm and (iii) matrix multiplications.Some platforms (for example DSPs) are better-suited to ‘multiply-accumulate’ calculationsthan to ‘shift and add’ operations and so the matrix implementation (described in C code

in [20]) may be more appropriate for these platforms

7.2.3.3 Object Boundaries

In a Core or Main Proﬁle MPEG-4 CODEC, residual coefﬁcients in a boundary MB are codedusing the 8× 8 DCT Figure 7.12 shows one block from a boundary MB (with the transparentpixels set to 0 and displayed here as black) The entire block (including the transparent pixels)

is transformed with an 8× 8 DCT and quantised and the reconstructed block after rescalingand inverse DCT is shown in Figure 7.13 Note that some of the formerly transparent pixelsare now nonzero due to quantisation distortion (e.g the pixel marked with a white ‘cross’).The decoder discards the transparent pixels (according to the BAB transparency map) andretains the opaque pixels

Using an 8× 8 DCT and IDCT for an irregular-shaped region of opaque pixels is notideal because the transparent pixels contribute to the energy in the DCT coefficients and somore data is coded than is absolutely necessary Because the transparent pixel positions arediscarded by the decoder, the encoder may place any data at all in these positions prior tothe DCT Various strategies have been proposed for filling (padding) the transparent positionsprior to applying the 8× 8 DCT, for example, by padding with values selected to minimisethe energy in the DCT coefficients [21, 22], but choosing the optimal padding values is acomputationally expensive process A simple alternative is to pad the transparent positions in

an inter-coded MB with zeros (since the motion-compensated residual is usually close to zeroanyway) and to pad the transparent positions in an inter-coded MB with the value 2N−1, where

N is the number of bits per pixel (since this is mid-way between the minimum and maximum

pixel value) The Shape-Adaptive DCT (see Chapter 5) provides a more efficient solution fortransforming irregular-shaped blocks but is computationally intensive and is only available inthe Advanced Coding Efficiency Profile of MPEG-4 Visual

Trang 15

block-a relblock-atively smblock-all block of sblock-amples)

7.2.5 Quantise/Rescale

Scalar quantisation and rescaling (Chapter 3) can be implemented by division and/or tiplication by constant parameters (controlled by a quantisation parameter or quantiser stepsize) In general, multiplication is an expensive computation and some gains may be achieved

mul-by integrating the quantisation and rescaling multiplications with the forward and inversetransforms respectively In H.264, the speciﬁcation of the quantiser is combined with that ofthe transform in order to facilitate this combination (see Chapter 6)

7.2.6 Entropy Coding

7.2.6.1 Variable-Length Encoding

In Chapter 3 we introduced the concept of entropy coding using variable-length codes (VLCs)

In MPEG-4 Visual and H.264, the VLC required to encode each data symbol is deﬁned by thestandard During encoding each data symbol is replaced by the appropriate VLC, determined

by (a) the context (e.g whether the data symbol is a header value, transform coefﬁcient,

Tiêu đề	Design and Performance
Tác giả	Iain E. G. Richardson
Trường học	John Wiley & Sons, Ltd.
Chuyên ngành	Video Compression
Thể loại	sách
Năm xuất bản	2003
Thành phố	Hoboken

Định dạng
Số trang	31
Dung lượng	543,42 KB