FUNCTIONAL DESIGN •22732×32 block in current frame Figure 7.1 Current block white border decoder and has a significant effect on CODEC performance.. The goal of apractical motion estimati
Trang 2In this chapter we give an overview of practical issues related to the design of software orhardware implementations of the coding standards The design of each of the main functionalblocks of a CODEC (such as motion estimation, transform and entropy coding) can have asignificant impact on computational efficiency and compression performance We discuss theinterfaces to a video encoder and decoder and the value of video pre-processing to reduceinput noise and post-processing to minimise coding artefacts.
Comparing the performance of video coding algorithms is a difficult task, not least cause decoded video quality is dependent on the input video material and is inherently subjec-tive We compare the subjective and objective (PSNR) coding performance of MPEG-4 Visualand H.264 reference model encoders using selected test video sequences Compression per-formance often comes at a computational cost and we discuss the computational performancerequirements of the two standards
be-The compressed video data produced by an encoder is typically stored or transmittedacross a network In many practical applications, it is necessary to control the bitrate of theencoded data stream in order to match the available bitrate of a delivery mechanism Wediscuss practical bitrate control and network transport issues
7.2 FUNCTIONAL DESIGN
Figures 3.51 and 3.52 show typical structures for a motion-compensated transform based videoencoder and decoder A practical MPEG-4 Visual or H.264 CODEC is required to implementsome or all of the functions shown in these figures (even if the CODEC structure is different
H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia.
Trang 3
DESIGN AND PERFORMANCE
•226
from that shown) Conforming to the MPEG-4/H.264 standards, whilst maintaining good pression and computational performance, requires careful design of the CODEC functionalblocks The goal of a functional block design is to achieve good rate/distortion performance(see Section 7.4.3) whilst keeping computational overheads to an acceptable level
Functions such as motion estimation, transforms and entropy coding can be highly putationally intensive Many practical platforms for video compression are power-limited orcomputation-limited and so it is important to design the functional blocks with these limita-tions in mind In this section we discuss practical approaches and tradeoffs in the design ofthe main functional blocks of a video CODEC
com-7.2.1 Segmentation
The object-based functionalities of MPEG-4 (Core, Main and related profiles) require a video
scene to be segmented into objects Segmentation methods usually fall into three categories:
1 Manual segmentation: this requires a human operator to identify manually the borders ofeach object in each source video frame, a very time-consuming process that is obviouslyonly suitable for ‘offline’ video content (video data captured in advance of coding andtransmission) This approach may be appropriate, for example, for segmentation of animportant visual object that may be viewed by many users and/or re-used many times indifferent composed video sequences
2 Semi-automatic segmentation: a human operator identifies objects and perhaps objectboundaries in one frame; a segmentation algorithm refines the object boundaries (if neces-sary) and tracks the video objects through successive frames of the sequence
3 Fully-automatic segmentation: an algorithm attempts to carry out a complete segmentation
of a visual scene without any user input, based on (for example) spatial characteristics such
as edges and temporal characteristics such as object motion between frames
Semi-automatic segmentation [1,2] has the potential to give better results than fully-automaticsegmentation but still requires user input Many algorithms have been proposed for automaticsegmentation [3,4] In general, better segmentation performance can be achieved at the expense
of greater computational complexity Some of the more sophisticated segmentation algorithmsrequire significantly more computation than the video encoding process itself Reasonablyaccurate segmentation performance can be achieved by spatio-temporal approaches (e.g [3])
in which a coarse approximate segmentation is formed based on spatial information and isthen refined as objects move Excellent segmentation results can be obtained in controlledenvironments (for example, if a TV presenter stands in front of a blue background) but theresults for practical scenarios are less robust
The output of a segmentation process is a sequence of mask frames for each VO, eachframe containing a binary mask for one VOP (e.g Figure 5.30) that determines the processing
of MBs and blocks and is coded as a BAB in each boundary MB position
7.2.2 Motion Estimation
Motion estimation is the process of selecting an offset to a suitable reference area in a previously
coded frame (see Chapter 3) Motion estimation is carried out in a video encoder (not in a
Trang 4FUNCTIONAL DESIGN •227
32×32 block in current frame
Figure 7.1 Current block (white border)
decoder) and has a significant effect on CODEC performance A good choice of predictionreference minimises the energy in the motion-compensated residual which in turn maximisescompression performance However, finding the ‘best’ offset can be a very computationally-intensive procedure
The offset between the current region or block and the reference area (motion vector)may be constrained by the semantics of the coding standard Typically, the reference area isconstrained to lie within a rectangle centred upon the position of the current block or region.Figure 7.1 shows a 32× 32-sample block (outlined in white) that is to be motion-compensated.Figure 7.2 shows the same block position in the previous frame (outlined in white) and a largersquare extending± 7 samples around the block position in each direction The motion vectormay ‘point’ to any reference area within the larger square (the search area) The goal of apractical motion estimation algorithm is to find a vector that minimises the residual energyafter motion compensation, whilst keeping the computational complexity within acceptablelimits The choice of algorithm depends on the platform (e.g software or hardware) and onwhether motion estimation is block-based or region-based
7.2.2.1 Block Based Motion Estimation
Energy Measures
Motion compensation aims to minimise the energy of the residual transform coefficients afterquantisation The energy in a transformed block depends on the energy in the residual block(prior to the transform) Motion estimation therefore aims to find a ‘match’ to the currentblock or region that minimises the energy in the motion compensated residual (the differencebetween the current block and the reference area) This usually involves evaluating the residualenergy at a number of different offsets The choice of measure for ‘energy’ affects compu-tational complexity and the accuracy of the motion estimation process Equation 7.1, equa-tion 7.2 and equation 7.3 describe three energy measures, MSE, MAE and SAE The motion
Trang 5DESIGN AND PERFORMANCE
•228
Previous (reference) frame
Figure 7.2 Search region in previous (reference) frame
compensation block size is N × N samples; C i j and R i jare current and reference area samplesrespectively
1 Mean Squared Error: M S E= 1
SAE is probably the most widely-used measure of residual energy for reasons of tional simplicity The H.264 reference model software [5] uses SA(T)D, the sum of absolute
computa-differences of the transformed residual data, as its prediction energy measure (for both Intra
and Inter prediction) Transforming the residual at each search location increases computationbut improves the accuracy of the energy measure A simple multiply-free transform is usedand so the extra computational cost is not excessive
The results of the above example indicate that the best choice of motion vector is(+2,0) The minimum of the MSE or SAE map indicates the offset that produces a mini-mal residual energy and this is likely to produce the smallest energy of quantised transform
Trang 7DESIGN AND PERFORMANCE
•230
Centre (0,0) position
Initial search location
Raster search order Search ‘window’
Figure 7.6 Full search (raster scan)
coefficients The motion vector itself must be transmitted to the decoder, however, and aslarger vectors are coded using more bits than small-magnitude vectors (see Chapter 3) itmay be useful to ‘bias’ the choice of vector towards (0,0) This can be achieved simply bysubtracting a constant from the MSE or SAE at position (0,0) A more sophisticated approach
is to treat the choice of vector as a constrained optimisation problem [6] The H.264 referencemodel encoder [5] adds a cost parameter for each coded element (MVD, prediction mode, etc)before choosing the smallest total cost of motion prediction
It may not always be necessary to calculate SAE (or MAE or MSE) completely at each set location A popular shortcut is to terminate the calculation early once the previous minimumSAE has been exceeded For example, after calculating each inner sum of equation (7.3)(N−1
off-j=0 |C i j − R i j|), the encoder compares the total SAE with the previous minimum If thetotal so far exceeds the previous minimum, the calculation is terminated (since there is no point
in finishing the calculation if the outcome is already higher than the previous minimum SAE)
Full Search
Full Search motion estimation involves evaluating equation 7.3 (SAE) at each point in thesearch window (±S samples about position (0,0), the position of the current macroblock).Full search estimation is guaranteed to find the minimum SAE (or MAE or MSE) in thesearch window but it is computationally intensive since the energy measure (e.g equation
(7.3)) must be calculated at every one of (2S+ 1)2locations
Figure 7.6 shows an example of a Full Search strategy The first search location is
at the top-left of the window (position [−S, −S]) and the search proceeds in raster order
Trang 8FUNCTIONAL DESIGN •231
Initial search location
Spiral search order Search ‘window’
Figure 7.7 Full search (spiral scan)
until all positions have been evaluated In a typical video sequence, most motion vectors areconcentrated around (0,0) and so it is likely that a minimum will be found in this region.The computation of the full search algorithm can be simplified by starting the search at (0,0)and proceeding to test points in a spiral pattern around this location (Figure 7.7) If earlytermination is used (see above), the SAE calculation is increasingly likely to be terminatedearly (thereby saving computation) as the search pattern widens outwards
‘Fast’ Search Algorithms
Even with the use of early termination, Full Search motion estimation is too computationallyintensive for many practical applications In computation- or power-limited applications, so-called ‘fast search’ algorithms are preferable These algorithms operate by calculating theenergy measure (e.g SAE) at a subset of locations within the search window
The popular Three Step Search (TSS, sometimes described as N-Step Search) is illustrated
in Figure 7.8 SAE is calculated at position (0,0) (the centre of the Figure) and at eight locations
±2N−1(for a search window of±(2N − 1) samples) In the figure, S is 7 and the first nine
search locations are numbered ‘1’ The search location that gives the smallest SAE is chosen
as the new search centre and a further eight locations are searched, this time at half the previousdistance from the search centre (numbered ‘2’ in the figure) Once again, the ‘best’ location
is chosen as the new search origin and the algorithm is repeated until the search distance
cannot be subdivided further The TSS is considerably simpler than Full Search (8N+ 1searches compared with (2N+1− 1)2 searches for Full Search) but the TSS (and other fast
Trang 9DESIGN AND PERFORMANCE
1
2
2 2
2 2
2
2
3 3
3 3 3 3 3 3
Figure 7.8 Three Step Search
-10
0
10 -8 -6 -4 -2 0
Figure 7.9 SAE map showing several local minima
search algorithms) do not usually perform as well as Full Search The SAE map shown inFigure 7.5 has a single minimum point and the TSS is likely to find this minimum correctly, butthe SAE map for a block containing complex detail and/or different moving components mayhave several local minima (e.g see Figure 7.9) Whilst the Full Search will always identify theglobal minimum, a fast search algorithm may become ‘trapped’ in a local minimum, giving asuboptimal result
Trang 10FUNCTIONAL DESIGN •233
0
1 2 3 1 1 1 1
2 2 3
Predicted vector
Figure 7.10 Nearest Neighbours Search
Many fast search algorithms have been proposed, such as Logarithmic Search, chical Search, Cross Search and One at a Time Search [7–9] In each case, the performance
Hierar-of the algorithm can be evaluated by comparison with Full Search Suitable comparisoncriteria are compression performance (how effective is the algorithm at minimising themotion-compensated residual?) and computational performance (how much computation issaved compared with Full Search?) Other criteria may be helpful; for example, some ‘fast’algorithms such as Hierarchical Search are better-suited to hardware implementation thanothers
Nearest Neighbours Search [10] is a fast motion estimation algorithm that has low putational complexity but closely approaches the performance of Full Search within the frame-work of MPEG-4 Simple Profile In MPEG-4 Visual, each block or macroblock motion vector
com-is differentially encoded A predicted vector com-is calculated (based on previously-coded tors from neighbouring blocks) and the difference (MVD) between the current vector and thepredicted vector is transmitted NNS exploits this property by giving preference to vectorsthat are close to the predicted vector (and hence minimise MVD) First, SAE is evaluated atlocation (0,0) Then, the search origin is set to the predicted vector location and surroundingpoints in a diamond shape are evaluated (labelled ‘1’ in Figure 7.10) The next step depends
vec-on which of the points have the lowest SAE If the (0,0) point or the centre of the diamvec-ondhave the lowest SAE, then the search terminates If a point on the edge of the diamond hasthe lowest SAE (the highlighted point in this example), that becomes the centre of a newdiamond-shaped search pattern and the search continues In the figure, the search terminatesafter the points marked ‘3’ are searched The inherent bias towards the predicted vector givesexcellent compression performance (close to the performance achieved by full search) withlow computational complexity
Sub-pixel Motion Estimation
Chapter 3 demonstrated that better motion compensation can be achieved by allowing theoffset into the reference frame (the motion vector) to take fractional values rather than justinteger values For example, the woman’s head will not necessarily move by an integer number
of pixels from the previous frame (Figure 7.2) to the current frame (Figure 7.1) Increased
Trang 11DESIGN AND PERFORMANCE
•234
fractional accuracy (half-pixel vectors in MPEG-4 Simple Profile, quarter-pixel vectors inAdvanced Simple profile and H.264) can provide a better match and reduce the energy inthe motion-compensated residual This gain is offset against the need to transmit fractionalmotion vectors (which increases the number of bits required to represent motion vectors) andthe increased complexity of sub-pixel motion estimation and compensation
Sub-pixel motion estimation requires the encoder to interpolate between integer samplepositions in the reference frame as discussed in Chapter 3 Interpolation is computationallyintensive, especially so for quarter-pixel interpolation because a high-order interpolation filter
is required for good compression performance (see Chapter 6) Calculating sub-pixel samplesfor the entire search window is not usually necessary Instead, it is sufficient to find the bestinteger-pixel match (using Full Search or one of the fast search algorithms discussed above)and then to search interpolated positions adjacent to this position In the case of quarter-pixelmotion estimation, first the best integer match is found; then the best half-pixel position match
in the immediate neighbourhood is calculated; finally the best quarter-pixel match around thishalf-pixel position is found
7.2.2.2 Object Based Motion Estimation
Chapter 5 described the process of motion compensated prediction and reconstruction(MC/MR) of boundary MBs in an MPEG-4 Core Profile VOP During MC/MR, transparentpixels in boundary and transparent MBs are padded prior to forming a motion compensatedprediction In order to find the optimum prediction for each MB, motion estimation should becarried out using the padded reference frame Object-based motion estimation consists of thefollowing steps
1 Pad transparent pixel positions in the reference VOP as described in Chapter 5
2 Carry out block-based motion estimation to find the best match for the current MB in thepadded reference VOP If the current MB is a boundary MB, the energy measure shouldonly be calculated for opaque pixel positions in the current MB
Motion estimation for arbitrary-shaped VOs is more complex than for rectangular frames (orslices/VOs) In [11] the computation and compression performance of a number of popularmotion estimation algorithms are compared for the rectangular and object-based cases Meth-ods of padding boundary MBs using graphics co-processor functions are described in [12]and a hardware architecture for Motion Estimation, Motion Compensation and CAE shapecoding is presented in [13]
7.2.3 DCT/IDCT
The Discrete Cosine Transform is widely used in image and video compression algorithms
in order to decorrelate image or residual data prior to quantisation and compression (seeChapter 3) The basic FDCT and IDCT equations (equations (3.4) and (3.5)), if implementeddirectly, require a large number of multiplications and additions It is possible to exploit the
structure of the transform matrix A in order to significantly reduce computational complexity
and this is one of the reasons for the popularity of the DCT
Trang 12FUNCTIONAL DESIGN •235
Direct evaluation of equation (3.4) for an 8× 8 FDCT (where N = 8) requires 64 × 64 =
4096 multiplications and accumulations From the matrix form (equation (3.1)) it is clear that
the 2D transform can be evaluated in two stages (i.e calculate AX and then multiply by matrix
AT, or vice versa) The 1D FDCT is given by equation (7.4), where f i are the N input samples and F x are the N output coefficients Rearranging the 2D FDCT equation (equation (3.4))
shows that the 2D FDCT can be constructed from two 1D transforms (equation (7.5)) The2D FDCT may be calculated by evaluating a 1D FDCT of each column of the input matrix(the inner transform), then evaluating a 1D FDCT of each row of the result of the first set oftransforms (the outer transform) The 2D IDCT can be manipulated in a similar way (equation(7.6)) Each eight-point 1D transform takes 64 multiply/accumulate operations, giving a total
of 64× 8 × 2 = 1024 multiply/accumulate operations for an 8 × 8 FDCT or IDCT
cos(2i + 1)xπ
cos(2i + 1)xπ
At first glance, calculating an eight-point 1-D FDCT (equation (7.4)) requires the evaluation ofeight different cosine factors (cos(2i+1)xπ 2N with eight values of i ) for each of eight coefficient indices (x = 0 7) However, the symmetries of the cosine function make it possible to
combine many of these calculations into a reduced number of steps For example, consider
the calculation of F2(from equation (7.4)):
+ f2cos
5π8
+ f3cos
7π8
+ f4cos
9π8
+ f5cos
11π8
+ f6cos
13π8
+ f7cos
15π8
(7.7)Evaluating equation (7.7) would seem to require eight multiplications and seven additions(plus a scaling by a half) However, by making use of the symmetrical properties of the cosinefunction this can be simplified to:
In a similar way, F6may be simplified to:
once so that F2and F6can be calculated using a total of eight additions and four tions (plus a final scaling by a half) Extending this approach to the complete 8× 8 FDCTleads to a number of alternative ‘fast’ implementations such as the popular algorithm due to
Trang 13multiplica-DESIGN AND PERFORMANCE
Chen, Smith and Fralick [14] The data flow through this 1D algorithm can be represented as
a ‘flowgraph’ (Figure 7.11) In this figure, a circle indicates addition of two inputs, a square
indicates multiplication by a constant and cX indicates the constant cos(X π/16) This
algo-rithm requires only 26 additions or subtractions and 20 multiplications (in comparison withthe 64 multiplications and 64 additions required to evaluate equation (7.4))
Figure 7.11 is just one possible simplification of the 1D DCT algorithm Many type algorithms have been developed over the years, optimised for a range of implementationrequirements (e.g minimal multiplications, minimal subtractions, etc.) Further computationalgains can be obtained by direct optimisation of the 2D DCT (usually at the expense of increasedimplementation complexity)
flowgraph-Flowgraph algorithms are very popular for software CODECs where (in many cases)the best performance is achieved by minimising the number of computationally-expensivemultiply operations For a hardware implementation, regular data flow may be more importantthan the number of operations and so a different approach may be required Popular hardwarearchitectures for the FDCT / IDCT include those based on parallel multiplier arrays anddistributed arithmetic [15–18]
The integer IDCT approximations specified in the H.264 standard have been designed to besuitable for fast, efficient software and hardware implementation The original proposal for the
Trang 14FUNCTIONAL DESIGN •237
Figure 7.12 8× 8 block in boundary MB
forward and inverse transforms [19] describes alternative implementations using (i) a series ofshifts and additions (‘shift and add’), (ii) a flowgraph algorithm and (iii) matrix multiplications.Some platforms (for example DSPs) are better-suited to ‘multiply-accumulate’ calculationsthan to ‘shift and add’ operations and so the matrix implementation (described in C code
in [20]) may be more appropriate for these platforms
7.2.3.3 Object Boundaries
In a Core or Main Profile MPEG-4 CODEC, residual coefficients in a boundary MB are codedusing the 8× 8 DCT Figure 7.12 shows one block from a boundary MB (with the transparentpixels set to 0 and displayed here as black) The entire block (including the transparent pixels)
is transformed with an 8× 8 DCT and quantised and the reconstructed block after rescalingand inverse DCT is shown in Figure 7.13 Note that some of the formerly transparent pixelsare now nonzero due to quantisation distortion (e.g the pixel marked with a white ‘cross’).The decoder discards the transparent pixels (according to the BAB transparency map) andretains the opaque pixels
Using an 8× 8 DCT and IDCT for an irregular-shaped region of opaque pixels is notideal because the transparent pixels contribute to the energy in the DCT coefficients and somore data is coded than is absolutely necessary Because the transparent pixel positions arediscarded by the decoder, the encoder may place any data at all in these positions prior tothe DCT Various strategies have been proposed for filling (padding) the transparent positionsprior to applying the 8× 8 DCT, for example, by padding with values selected to minimisethe energy in the DCT coefficients [21, 22], but choosing the optimal padding values is acomputationally expensive process A simple alternative is to pad the transparent positions in
an inter-coded MB with zeros (since the motion-compensated residual is usually close to zeroanyway) and to pad the transparent positions in an inter-coded MB with the value 2N−1, where
N is the number of bits per pixel (since this is mid-way between the minimum and maximum
pixel value) The Shape-Adaptive DCT (see Chapter 5) provides a more efficient solution fortransforming irregular-shaped blocks but is computationally intensive and is only available inthe Advanced Coding Efficiency Profile of MPEG-4 Visual
Trang 15DESIGN AND PERFORMANCE
block-a relblock-atively smblock-all block of sblock-amples)
7.2.5 Quantise/Rescale
Scalar quantisation and rescaling (Chapter 3) can be implemented by division and/or tiplication by constant parameters (controlled by a quantisation parameter or quantiser stepsize) In general, multiplication is an expensive computation and some gains may be achieved
mul-by integrating the quantisation and rescaling multiplications with the forward and inversetransforms respectively In H.264, the specification of the quantiser is combined with that ofthe transform in order to facilitate this combination (see Chapter 6)
7.2.6 Entropy Coding
7.2.6.1 Variable-Length Encoding
In Chapter 3 we introduced the concept of entropy coding using variable-length codes (VLCs)
In MPEG-4 Visual and H.264, the VLC required to encode each data symbol is defined by thestandard During encoding each data symbol is replaced by the appropriate VLC, determined
by (a) the context (e.g whether the data symbol is a header value, transform coefficient,