To be specific, the contributions are: 1 A fast mode decision algorithm is presented for intra prediction in H.264 video coding.. 2 A fast inter mode decision algorithm is proposed to de
Trang 1To achieve the highest coding efficiency, H.264/AVC uses rate distortion optimization (RDO) technique to get the best coding result in terms of maximizing coding quality and minimizing resulting data bits The idea of RDO can be briefly explained as follows: the encoder examines all possible modes of blocks such as different directions in intra spatial predication, different block sizes and multiple reference frames in the case of inter prediction and chooses the mode with the least
Trang 2RDO cost This brute-force effort of RDO achieves much better performance, but at the expense of very high computational complexity Even with the state-of-the-art hardware technology, the real-time video coding using H.264/AVC is still a prohibitive task Therefore, algorithms for reducing the time complexity of H.264/AVC, while maintaining the coded bit rate and reconstructed video quality, are
indispensable for real-time implementation of H.264/AVC
Since early 1990’s, video coding technology has evolved continuously, generating international video coding standards such as H.261, H.263, H.263++, MPEG-1, MPEG-2, and MPEG-4 Visual They have contributed tremendously to the successful commercialization of digital video coding Similar to these previous video coding standards, H.264 will continue to provide technical solutions in many targeted application fields such as mobile video communication, digital media production and telemedicine
1.2 Objectives
Because of the computational complexity of H.264, the thesis aims at developing fast algorithms that can improve the encoding speed of H.264 without much loss of visual quality In detail, the objectives are:
(1) Develop new fast and efficient intra mode coding methods for H.264: these methods should adaptively select the most possible candidates for direction prediction The approaches are capable of achieving low complexity performance of existing
Trang 3(2) Explore a new scheme to inter mode coding for H.264: the scheme should effectively reduce the time spent on selecting different block sizes in inter coding with minimum sacrifice in visual quality
(3) Explore new interpolation approaches for H.264: the approaches should be lossless and greatly reduce the time incurred due to the interpolation process in the encoder
1.3 Thesis Contributions
This thesis has proposed algorithms that are able to achieve the set objectives They are not only contributions to the academic, but to the industry as well To be specific, the contributions are:
(1) A fast mode decision algorithm is presented for intra prediction in H.264 video coding By making use of the edge direction histogram, the number of mode combinations for luminance and chrominance blocks in a macroblock (MB) that take part in RDO calculation has been reduced significantly from 592 to as low as 132 This results in great reduction in the complexity and computation load of the encoder Experimental results show that the fast algorithm has a very negligible loss of PSNR compared to the original scheme
(2) A fast inter mode decision algorithm is proposed to decide the best mode in inter coding of H.264 It makes use of the spatial homogeneity and the temporal
Trang 4stationarity characteristics of the textures of video objects Specifically, homogeneity decision of a block is based on edge information inside the block, and co-sited MB difference is used to decide whether the MB is temporal stationary Based on the homogeneity and stationarity of the video objects, only a small number of inter modes are used in RDO The experimental results show that the fast algorithm is able to reduce on the average 30% encoding time, with negligible PSNR loss
(3) Two fast intra 4x4 mode elimination approaches are put forward for H.264 The lossless approach checks cost after each 4x4 block intra mode decision, and terminates if the cost is higher than the minimum cost of inter mode coding The lossy approach, by using some low cost preprocessing to make prediction, terminates if the cost is higher than some fraction of the minimum cost of inter mode Experimental results show that the lossless approach can reduce the encoding time without any sacrifice of visual quality The lossy approach can further reduce encoding time with negligible PSNR loss or bit rate increment
(4) Two adaptive interpolation methods are also presented that significantly reduce the interpolation operation required in H.264 video coding By making use of flag matrix data structure and interpolation on-demand, the proposed methods are able to increase encoder speed greatly without any PSNR loss or bit rate increase
1.4 Organization of the Thesis
The rest of this thesis is structured as follows In Chapter 2, a brief introduction to
Trang 5Chapter 3, a fast intra mode decision method is proposed A fast inter mode decision approach is given in Chapter 4 Intra 4x4 mode elimination approaches are presented
in Chapter 5 Adaptive interpolation methods are described in Chapter 6 In Chapter 7, the contributions of this thesis are summarized and future work is outlined
Trang 6CHAPTER 2
H.264 AND LITERATURE SURVEY
In this chapter, an overview of the H.264 standard will be presented Furthermore, some important aspects of the standard will be briefly introduced They include the architectures of the encoder and the decoder, inter mode decision, intra mode decision and motion estimation A literature survey is presented in the later part of the chapter
2.1 H.264
The well-known international standards on video coding such as H.261, H.263, MPEG-1, MPEG-2 and MPEG-4 Visual have been developed in the past one or two decades A few years ago, the ITU-T video coding expert group (VCEG) aimed at putting long-term effort to further development of a new standard for low bit rate video coding and communication applications This long-term effort has resulted in the standardization of H.26L, which demonstrates much higher compression efficiency than existing ITU-T standards such as H.261 and H.263
In 2001, the ISO /MPEG joined ITU-T/VCEG for further development of H.26L
A joint team called Joint Video Team (JVT) was formed, whose main goal was to improve the draft H.26L model into a final, complete international standard This new standard is known as advanced video coding (AVC), which is also called MPEG-4 Part
Trang 7recommendation by ITU-T and in the same year, AVC was accepted as international standard by ISO H.264 and/or AVC will be used interchangeably in this thesis
H.264 was proposed under the MPEG requirement for advanced video coding tools Compared with MPEG-4 Visual, it has a narrower scope and targets mainly at supporting more efficient and robust coding and transmission of video frames instead
of segmenting different objects inside the frames Its original aim was to provide similar functionality to existing video coding standards such as H.263 and Simple Profile of MPEG-4 Visual but with significantly better compression efficiency and more robust and reliable transportation over transmission channels It targets at applications including duplex video communication, also known as video conferencing
or video telephony, digital television broadcasting, digital video streaming, telemedicine applications, digital video storage and digital cinema, etc Support for robust transmission over various network architecture is built inside In addition, the standard is designed to facilitate implementations on a wide range of processor platforms such as Intel, AMD and Sun Solaris One aspect on which the standard differentiates itself and other existing video coding standards is an attempt to interoperate easily among different developers to avoid misinterpretation [4, 5]
The elements common to all video coding standards are present in the current H.264 recommendation Specifically, macroblocks are 16x16 in size Luminance is represented with higher resolution than chrominance with 4:2:0 sub-sampling Motion compensation and block transforms are followed by scalar quantization and entropy coding Motion vectors are predicted from the median of the motion vectors of neighboring blocks Bi-directional B-pictures are supported that may be motion
Trang 8compensated from both temporally previous and subsequent pictures A direct mode exists for B-pictures in which both forward and backward motion vectors are derived from the motion vector of a co-located macroblock in a reference picture In addition, H.264 has many advantages that distinguish itself from existing video coding standards, while at the same time having similar common features with other existing standards
Some of the key advantages of H.264 are:
(1) Up to 50% in bit rate savings compared to MPEG-4 Visual;
(2) Much better visual quality and PSNR value;
(3) Better error resilience technology;
(4) Network adaptation friendliness
Experimental results [6, 7, 8, 9] have demonstrated that H.264 has achieved substantial better video quality over that achieved by H.261, H.263, MPEG-2, and MPEG-4 Visual The JVT reference model software is able to achieve up to 50% in bit rate saving compared with the existing H.263 or MPEG-4 Visual codec In other words, this implies that H.264 provides significantly better visual quality using the same bit rates In addition to new coding features such as inter and intra prediction, H.264 utilizes some error resilient techniques to cope with different channel environments These characteristics make H.264 an ideal codec for applications with very limited channel capacity, storage limitation and extremely error prone channels such as mobile communication and video telephony
Because of its high compression ratio, the H.264 codec can be utilized to generate
Trang 9previous video coding standards such as H.263 and MPEG-4 Visual Therefore, it is no doubt that H.264 will be a strong competitor in the deployment of next generation multimedia applications Besides, one important feature of H.264 is that it is an open standard This will bring the codec price down, making the technology affordable to everyone Furthermore, the bit stream format of H.264 is non-proprietary, which is crucial in today’s multimedia application environment
Integer Transform
Entropy Encode
Predictive Coding (Intra / Inter)
Inverse Transform
Inverse Quantization
Figure 2.1 H.264 encoder architecture
Figure 2.1 illustrates the architecture of the H.264 encoder The current frame, denoted by fn, is one of the original video sequences input to encoder The frame will
be divided into partitions in the unit of marcoblocks, each of 16 by 16 pixels in size These macroblocks will be encoded one by one in the raster scan order until the whole frame has been processed
Predictive coding is applied to each macroblock, in the sense that it will be encoded either as intra mode or inter mode A prediction block P of the input source macroblock will be generated, which will be further subtracted from the input
Trang 10macroblock in order to form the residue, Dn Integer transform (approximation to Discrete Cosine Transform) is performed on Dn and the transform coefficients are quantized It should be noted that the transform itself does not compress information whereas quantization compresses data in a lossy approach by discarding irrelevant information The quantized transform coefficients are re-ordered and entropy coded Entropy encoding compresses information losslessly by exploiting the information redundancy The entropy-coded coefficients, together with side information required for decoding the macroblock (such as the quantizer step size, prediction direction in intra mode or block size decision in inter mode, motion vectors and reference frame number) form the compressed bit stream The bit stream is further passed to a Network Abstraction Layer (NAL) for transmission or storage
In addition, there exists a reconstruction path inside the encoder, which essentially serves as the decoder The purpose is to reconstruct the encoded block for future predictive coding of other macroblocks that will reference the encoded block In detail, quantized coefficients are inversely quantized and inversely transformed, resulting in a decoded prediction residual, Dn' Dn' is added to the prediction, and the resultant frame
is fed into an in-loop deblocking filter to generate the final reconstruction frame
2.3 H.264 Decoder
F’n-1(reference)
F’n(reconstructed) Filter
Reorder Entropy
Decode
Predictive Coding
Inverse Transform Inverse
Quantization NAL
P
+ +
uF’n
Trang 11As illustrated in Figure 2.2, the H.264 decoder architecture resembles the reconstruction path in the encoder The difference lies in the prior processing that the decoder receives a compressed bit stream from the NAL and the data elements are entropy decoded and reordered to produce a set of quantized coefficients
2.4 Predictive Coding
Predictive coding in H.264 consists of two categories of prediction techniques One is intra coding, which makes prediction using the information inside the same frame The other is inter coding, which predicts using the information from other frames In common, the two approaches both attempt to find the best prediction for each input macroblock, thus leading to best coding gain The predictions are generally
of high complexity The details of the operations of these two modes are illustrated in the following sections
Trang 12Intra 4x4 mode separates each macroblock into 16 4x4 blocks In each 4x4 block, there are totally nine prediction patterns supported by the Intra 4x4 mode As illustrated in Figure 2.3, the directions consist of vertical, horizontal, direct coefficient (DC), diagonal down-left, diagonal down-right, vertical-right, horizontal-down, vertical-left and horizontal-up respectively To the contrast, intra 16x16 mode does not divide the macroblock and use only four prediction patterns They are vertical, horizontal, DC, and plane respectively, as shown in Figure 2.4 In addition, the 8x8 chrominance sample of the current macroblock will be predicted in the similar manner
as the intra 16x16 mode As illustrated in Figures 2.3 and 2.4, each prediction block is acquired through extrapolating or interpolating the pixels in various specific patterns The prediction mode which can minimize the residue error between the original input macroblock and its prediction block will be chosen as the final coding mode by the encoder
In general, intra coding is necessary for intra-coded frames (I Frame) since the blocks inside the frame have no other prediction information than neighboring blocks With regard to non-I frames, it is also very useful, particularly for coding regions with scene changes and regions with homogeneous characteristics Otherwise, it only provides moderate compression efficiency due to the relatively large number of bits used for representing the block information In order to further exploit information redundancy between neighboring frames inside video sequences, a more efficient and sophisticated predictive coding mode, known as inter coding, is used in H.264
Trang 13J K L
1 (horizontal)
M A B C D E F G H I
J K L
J K L
4 (diagonal down-right)
M A B C D E F G H I
J K L
J K L
7 (vertical-left)
M A B C D E F G H I
J K L
8 (horizontal-up)
Mean (A-D, I-L)
Figure 2.3 Intra 4x4 prediction modes
Trang 142.4.2 Inter Coding
Through exploiting temporal block similarities inherent in video sequences, inter coding generally has much higher coding efficiency than intra coding However, this comes at the price of significantly higher complexity as compared to intra coding Inter coding employs temporal prediction (known as motion estimation/compensation) from other previously reconstructed pictures, in attempt to exploit the temporal correlation existing among neighboring frames
8x16
P8x8
Figure 2.5 Variable partition sizes employed in inter coding
Various partition sizes are supported in inter coding Figure 2.5 illustrates the various partition sizes used in the inter mode As illustrated, partition with sizes of 16x16, 16x8, 8x16 and 8x8 are supported in inter coding Moreover, in the case where the P8x8 inter mode is chosen, each of the four 8x8 blocks can be further partitioned into 8x8, 8x4, 4x8 or 4x4 partition sizes These different partition sizes will be input to motion estimation process in order to get the best match
Trang 15Furthermore, there also exists a 16x16 SKIP mode, which can be employed to maximize the coding efficiency for a particular macroblock This mode is suitable in the case that the present macroblock can be fully predicted by directly copying corresponding macroblock from previously reconstructed frame Thus, no data will be encoded except one single bit whose mere usage is to inform the decoder that the macroblock is encoded as a SKIP macroblock Therefore, the SKIP mode offers the highest coding efficiency in relation to all other inter modes and intra mode
Considering the diverse options for different partition sizes and modes, it is certain that each macroblock has to be encoded using a mode carefully chosen from this potentially large number of candidates Normally, a large partition size is sufficient for encoding stationary and homogeneous regions inside one frame whilst a small partition size fits more to regions with more details
2.5 Motion Estimation
Similar to other conventional video coding standards, motion estimation/motion compensation as a block matching technique also plays a crucial role in the H.264 encoding process In the full search process of motion estimation, an exhaustive matching within some search window will be conducted in order to find the best prediction for the current input source block The search will be done with reference to the reconstructed picture that is commonly defined as reference frame For each block, the search will generally result in a motion vector pointing to the location where the best prediction block would be obtained in the corresponding reference frame
Trang 16Motion estimation using full search is highly computationally intensive, and thus a lot of research on fast motion estimation (FME) methods has been conducted in order
to reduce the number of search points In addition to the motion information, the sizes
of the blocks in terms of specific inter modes must also be encoded in the bit stream The choice of block sizes has significant impacts on the bit rate of encoded bit stream Generally, the smaller partition sizes rather than the bigger ones offer much better matches to the input macroblock However, the motion data related to these additional blocks will lead to more bits for the output encoded bit stream Therefore, a trade-off has to be made in order to balance the motion information and residue in order to generate optimum coding efficiency
At the motion estimation stage, the reference picture is searched for the candidate motion vectors, and the motion vector that results in the best prediction is chosen The motion estimation that is implemented in the H.264 test model chooses the motion vector that minimizes the following cost function
)())
(,()
[],[))
(
,
1 , 1
y
m x c y x s c
s
with s being the original video signal and c being the coded video signal
Trang 17One outstanding characteristic inherent in H.264 is that it allows multi-frame motion estimation/compensation Different from previous coding standards, H.264 will allow more than one prior-coded frame to be used as reference for motion compensated prediction Figure 2.6 illustrates the scenario where the current frame could possibly refer to five previously decoded frames The multi-frame ME/MC scheme is one of the major contributions to achieving high compressing efficiency for H.264 since it captures the temporally repetitive motions prevalent in the natural video sequences such as birds flying with the wings up and down, people walking to and fro, etc
current framefive reference frames
Figure 2.6 Multiple frame motion estimation/motion compensation
2.6 Mode Decision
Besides the features introduced above, in order to achieve the highest coding efficiency, H.264/AVC uses rate distortion optimization (RDO) technique to get the best coding result in terms of maximizing coding quality and minimizing resulting data bits The idea of RDO can be explained as follows: the encoder experiments on all possible modes of blocks such as different block sizes intra spatial predication, inter-
Trang 18frame motion estimation, multiple reference frames in the case of inter modes and choose the mode with the least rate distortion (RD) cost RD cost with regard to a certain mode is calculated using the information of distortion, rate and a Lagrangian multiplierλ λ is not very complex and is merely a function of the quantization parameter (QP) Nevertheless, rate is obtained only after a sequence of operations such
as motion estimation/motion compensation (in the case of inter coding), integer transform, quantization, inverse quantization, inverse integer transform and entropy coding During this procedure, distortion can be acquired after reconstructing the macroblock
The calculation of rate and distortion contributes to the overall RDO in achieving much better coding efficiency, but at the expense of very high computational complexity It makes real-time video coding using H.264/AVC a difficult problem Therefore, algorithms which reduce the computational complexity of H.264/AVC while maintaining the coded bit rate and reconstructed video quality are indispensable for real-time implementation of H.264/AVC
2.7 Literature Survey
In the literature, there have been quite a number of approaches to improving the encoding speed of H.264 They can be classified under the following categories: fast full pixel motion estimation, fast fractional pixel motion estimation, adaptive adjustment of search window size used in motion estimation, fast decision of reference frame, reduction of SAD calculation, detection of all-zero integer transformed blocks
In addition, there are also some proposed fast mode decision algorithms These
Trang 19The fast motion estimation (FME) method proposed in [10] sets a search pattern in the initial step It also uses motion vectors of previous block shapes as the prediction for the following block shapes within a macroblock In the final search stage, seven adaptive preferential search ranges will be used for seven shapes of the blocks These algorithms achieve significant time saving with negligible loss of coding efficiency
In the proposal [11], a hierarchical FME algorithm consisting of four main steps are proposed Firstly, the prediction of initial search point considers the motion vector relationship in spatial domain and between different prediction modes Secondly, an unsymmetrical-cross search is performed to give a good starting search point, after which an uneven multi-hexagon-grid search is used to keep search from dropping into local minimum In the last step, an extended hexagon based search is performed for refinement As for fractional pixel ME, a Center Biased Fractional Pixel Search (CBFPS) strategy with a diamond search pattern is also proposed
In the H.264 proposal document [12], an approach called Enhanced Predictive Zonal Search (EPZS) is proposed for motion estimation It comprises three features: the initial predictor selection, the adaptive early termination and the final prediction refinement In the first feature, only a small set of highly reliable predictors are examined such as median predictor, temporal predictor and spatial predictors With regard to the second feature, an early termination process is adopted since distortion of adjacent blocks tends to be highly correlated The third feature refines motion vectors
by using several types of search patterns
Trang 20The approach of reference frames reduction is also proposed In the paper [13], the authors present a method to reduce the computational cost due to multiple frame motion estimation without significant quality degradation Instead of checking all the blocks on each reference frame, search is only done on a center-biased path so that an ultimate frame can be selected for final search
A search window size decision algorithm is proposed [14] in the motion estimation process based on applying a motion detection algorithm The motion detection algorithm is based on coloring over sub-sampled binary images generated by block wise threshold image difference
Recently, Ates et al [15] proposed to reuse a limited set of sum of absolute
difference (SAD) values to approximate the SAD value of different block sizes An inevitable consequence of this approximation is quality drop in terms of PSNR value and/or bit rate increase For instance, the bit rate increment can be as high as 3.9% for the “Mobile” sequence
In the paper [16], an early detection algorithm for all-zero integer-transformed
blocks in H.264 is proposed The idea can be briefly explained as follows: if the minimum SAD obtained in the motion estimation stage is lowered according to some derived conditions, there is no need to do integer transform and quantization In this way, the computational savings can be obtained Similarly, [17, 18] propose an adaptive mode decision algorithm by using the property of all-zero coefficients block
Trang 21Some skip mode decision algorithms have been proposed Originated from [19] and improved in [20], the authors proposed two techniques for H.264 mode decision They try to make an early decision of SKIP mode in P and B frames, and do intra coding selectively They propose to check the SKIP conditions in the first place If it satisfies the conditions used in P and B slices, the SKIP mode is decided as the selected coding mode for the macroblock If spatial correlation of current block is higher than the temporal correlation, the block has higher probability of being an intra block It is proposed that the average boundary error is used between pixels at boundary of the current and its adjacent encoded blocks under the best inter mode as
an indicative of degree of spatial correlation The approach is applied to main profile
of H.264 Later, similar approaches are extended to high profile of H.264 [21, 22]
There are also some mode decision algorithms proposed by other researchers In [23], a mode decision method is proposed as the preprocessing part before motion estimation The algorithm is based on whether the error surface versus block size is monotonic, that is, whether the current macroblock has the same tendency of using smaller block size or larger block size If the error surface is not monotonic, all other modes need to be tested If the error surface is monotonic, only modes between the best two modes are tested Because of the coarse relationship of rate distortion cost, the result is not very promising The bit rate can be increased as high as 2.83% for the foreman sequence
In [24], the proposed approaches are based on both the cost of motion vector and the information of previous frame 1) A SNR based approach is developed to avoid using all the modes to encode each frame If the average SNR of the previous frame
Trang 22does not fall below the threshold value, the same set of modes are applied to the current frame Otherwise, it indicates that the modes are not suitable for encoding and
a new set should be decided for the current frame 2) In order to decide the best mode for the current block, a selection criteria based on adaptive threshold cost is applied The motion vector cost of the same block in the previous frame is used as the threshold cost on the current block If the cost falls within a range of previous cost, the previous set of modes is used Otherwise, all the possible modes are checked to re-calculate the modes Nevertheless, due to inaccurate prediction inherent to this method, the algorithms result in unacceptable bit rate increase There is even a 5% to 18% increase
in the bit stream sizes over those of the reference implementation in all tested sequences
The paper [25] proposes to categorize the modes into different subsets which are dependent on the characteristics of the video and quantization parameter Particularly, the motion compensability of every frame modulates the intra modes whereas the texture difficulty addresses the inter modes The approach has the following features: 1) With regard to the selection of intra mode, the algorithm evaluates the uncompensability of each frame 2) The inter modes are evaluated by motion estimation and are dependent on the quantizer level and the difficulty of the texture of each frame Since it just uses variance, it cannot differentiate the different direction in one intra mode For instance, it only tells when to use 4x4 mode, inside one 4x4 intra block Thus, the bit rate increment can be as high as 10% and PSNR drop can be as high as 0.30 dB
Trang 23A recent JVT proposal [26], which originated from [27], provides an approach to combine the fast intra mode decision and fast inter mode decision The approach will skip intra mode decision if selected inter mode is good enough according to some criteria
Dai et al [28] proposes a fast inter prediction mode for H.264 Firstly, it down
samples the original image to a smaller image of half the original resolution Then it pre-encodes the small image with prediction mode selection used in H.264 encoding and obtains the prediction mode of each 8x8 block The problem with this approach is that down sampling is a time consuming process Thus, it will be difficult for the approach to be applied to real-time applications
The paper [29] presents an inter mode decision method depending on the absolute differences between consecutive frames By comparing the SAD with a threshold, the mode classes are decided However, instead of SAD, the cost function, which is a weighted sum of both SAD and motion vectors, is used as the measurement
In [30], information from previously coded MBs, such as distortion, mode and residue, are used to determine which modes can be eliminated with little loss in coding efficiency Its emphasis is more on how to do transcoding from MPEG-2 bit streams to H.264 bit streams
The paper [31] classifies MBs into two classes: high probability MBs (SKIP, INTER 16x16, INTER 8x8), low probability MBs (others) Predict the probability of
Trang 24current MB according to its neighbors (left, up left, up) and the co-located in the previous frame If the predicted probability is lower than one minus the sum probability of SKIP, INTER 16x16 and INTER 8x8, then the MB is determined to be a low probability MB If the predicted probability is higher than the minimum of the sum probability of SKIP, INTER 16x16 and INTER 8x8, the MB is determined to be a high probability MB Otherwise, further classification is done However, the approach does not consider the dynamics existing in video sequences, and thus it is not suitable for scenarios where motions change between slow and fast in one video sequence
[32] proposes an inter mode decision scheme for P slices It initially exploits neighborhood information jointly with a set of SKIP mode conditions for enhanced skip mode decision It subsequently performs inter mode decision for the remaining macroblocks by using a gentle set of smoothness constraints
In [33], the process of the proposed method consists of two steps: decision I and decision II Decision I selects early SKIP mode decision in inter mode classes Decision II decides whether to try encoding intra mode or not If the Rdcost (rate distortion cost) of the best inter mode is less than a threshold, the routine of trying the intra mode can be omitted
The authors of [34] propose a pruned mode decision method consisting of three steps Decision I checks whether the best mode is SKIP Decision II selects one class between Classl6={SKIP, 16x16, 16x8, 8x16} and Class8={P8x8} Decision III decides
Trang 25Paper [35] proposes a fast multi-block selection scheme focusing on 16x16, 16x8, 8x16 and 8x8 modes It is a bottom-up approach in the sense that firstly motion estimation is applied to mode 8x8 block type, secondly to mode 16x16 and finally to mode 16x8 and mode 8x16 The determination is made based on the motion vectors However, the approach is compared with full search method of H.264 From the algorithm point of view, the fast block selection method is not compatible with other FME methods
Cheng and Chang [36] present a fast three-step algorithm for 4x4 intra prediction
in H.264 Motivated by the strong correlation of RD cost among different prediction directions, the algorithm, in a systematic process, explores the neighborhood directions adjacent to the minimum one and skips other unlikely directions Instead of 9 modes, 6 modes are required to determine the prediction mode in the full search method Because of the inaccurate information it used, the performance in time saving is not very high
Choi et al [37] proposes a fast mode decision scheme in which early decision is
possible for inter macroblock mode and the routine of computing RD cost is omitted for intra mode whenever possible In order to decide inter macroblock mode early, the inter modes are grouped into two classes: Class16 and Class8 If the Rdcost16 (rate distortion cost of 16x16 block) is less than Rdcost8 (rate distortion cost of 8x8 block), the mode has very high probability to be included under Class16 With regard to intra coding mode, the routine of Rdcost (rate distortion cost) for intra mode is omitted if
Trang 26the minimum Rdcost at one inter mode is below a threshold Due to the rough relationship among Rdcosts under high complexity mode, the results show a relatively high bit rate increase
The method proposed in [38] only needs to analyze a subset of the seven modes by using spatiotemporal predictions from neighboring blocks The coding modes of five neighboring blocks: the left, upper, upper left and upper right blocks of the current block, and the block at the same location in the previous frame are used for prediction
It further analyzes the reliability of each predicted mode of each inter-block before using the predicted mode for encoding MV variance within a MB and the magnitude
of MV difference are used to evaluate the reliability of the neighboring prediction information The results are achieved by jointly using low level programming techniques such as multi-media extension (MMX)
Kim and Altunbasak propose an algorithm [39] for fast coding mode selection in H.264 by reducing the number of candidate modes A RD optimal MB mode can be selected with high probability by examining only some of most probable candidate mode The problem with the approach is that the most probable modes are fixed and not adaptive depending on input sequences
Garg et al [40] propose an approach which finds the optimal intra prediction
mode for a 4x4 luminance block by investigating six possible modes in the worst case and two modes in the best case The algorithm is based on the statistical behavior of
Trang 27best-selected modes over a variety of sequences The criteria using RD cost is simple, but having the side effects of achieving not very significant improvement
Han and Lee [41] propose a block matching order for fast mode decision The algorithm skips variable block size motion estimation/compensation and spatial-predictive coding in H.264 video coding standard RD cost is compared with its mean value, which results in relatively higher inaccuracy
[42] proposes an algorithm adopting a multi-stage sequential mode decision process that uses joint spatial and transform domain features to filter out unlikely candidate modes Based on the multi-stage mode decision concept, the algorithm computes low cost features and checks whether the decision process should proceed to the next step or it can be terminated earlier with the most probable mode at each stage
The authors of [43] propose a two-step fast intra prediction mode selection algorithm In the first step, a coarse-level decision is made to split all possible candidate modes into two groups: the group to be examined further and the group to be ignored In the second step, the proposed algorithm focuses on the group of interest, and considers an RD model for final decision-making
A fast H.264 intra-prediction mode selection scheme is proposed in [44] The proposed method uses spatial and transform domain features of the target block jointly
to filter out the majority of candidate modes This is justified by examining the posterior error probability and the average rate-distortion loss For the final mode
Trang 28selection, either the feature-based or the RDO-based method is applied to 2-3 candidate modes It consists of the following features such as: 1) Feature selection: In the spatial domain, SAD between the true and the predicted pixel values is chosen as a spatial domain feature In the transform domain, SATD is chosen 2) Rank-ordered joint features; 3) Final mode selection
In [45], a feature-based fast intra/inter mode decision method is proposed to reduce the encoder complexity of the H.264 video coding standard The main idea is to decide the mode using the expected risk of choosing the wrong mode in a multidimensional feature space The proposed algorithm calculates three features and maps them into the one of three regions; namely, risk-free, risk-tolerable, and risk intolerable regions Depending on the mapping region, algorithms of different complexities for the final mode decision can be applied
2.8 Summary
An overview of the H.264 standard has been given in this chapter and some of the important parts are emphasized Specifically, these include the architectures of both encoder and decoder, inter and intra mode decision scheme, motion estimation The relevant research efforts in the literature to reducing encoder complexity have been summarized and evaluated Different from others, this thesis provides novel directions and approaches in reducing the H.264 encoding complexity and they successfully achieve the set targets The approaches will be described and analyzed in detail in the subsequent chapters
Trang 29CHAPTER 3
FAST INTRA MODE DECISION FOR H.264
In this chapter, a fast mode decision algorithm is presented for H.264 intra prediction based on local edge information to reduce the amount of calculations in intra prediction This method is based on the observation that the pixels along the direction of local edge are normally of similar values (this is true for both luminance and chrominance components) Therefore, a good prediction could be achieved if the pixels are to be predicted using those neighboring pixels lying in the same direction of the edge
Therefore, an edge map that represents the local edge orientation and strength is created, and a local edge direction histogram is then established for each sub-block Based on the distribution of the edge direction histogram and the concept of majority voting, only a small number of prediction modes are chosen for RDO calculation during intra prediction Experimental results show that the fast mode decision algorithms increase the speed of intra coding significantly
The rest of the chapter is organized as follows Section 3.1 gives an overview of intra coding in H.264 Section 3.2 will present in detail the fast intra prediction algorithm based on the edge direction histogram Experimental results will be presented in Section 3.3 and summary in Section 3.4.
Trang 303.1 Overview of Intra Coding in H.264
Intra coding refers to the case where only spatial redundancies within a video picture [1] are exploited The resulting picture is referred to as an I-picture Traditionally, I-pictures are encoded by directly applying the transform to all macroblocks in the picture, which generates much larger number of data bits compared
to that of inter coding In order to increase the efficiency of the intra coding, spatial correlation between adjacent block/macroblock in a given picture is exploited in H.264, i.e., we can predict the block/macroblock of interest from the surrounding blocks/macroblocks according to their directional information The difference between the actual block/macroblock and their prediction is then coded With these advanced prediction modes, the performance of intra-frame compression in H.264 is similar to that of the recent still image compression standard, JPEG-2000 H.263 and MPEG-4 Visual also provide intra prediction, which only allows intra prediction in frequency domain at macroblock level
Intra coded macroblocks may use either 16×16 or 4×4 spatial prediction modes for luminance components (luma) Four sub-modes are available with 16×16 prediction A 16×16 macroblock can be predicted from the previously adjacent decoded pixels that are available due to the raster order (from the top-left with left-to-right swaths) decoding of macroblocks: vertical prediction from pixels above, horizontal prediction from pixels to the left, and plane prediction by spatial interpolation between these two sets of pixels Nine sub-modes are available with 4×4 prediction as shown in Figure 3.1
Trang 31Similarly, a 4×4 sub-block can be predicted from the previously adjacent decoded pixels that are available due to the raster order decoding of each 8×8 block within a macroblock, and the nested raster order decoding of each 4×4 sub-block with each 8×8 block Due to this decoding order, not all of the 4×4 prediction modes have the decoded pixels available in their desired prediction direction In this case, the closest available decoded pixel is used For the chrominance (chroma) components, 4 prediction modes, similar to that of 16×16 luma prediction, are applied to the two 8×8 chroma blocks (U and V) Note that the resulting prediction mode for U and V components should be the same
Figure 3.1 illustrates the intra prediction for a 4×4 luma block Note that in this
figure, a to p are the pixels to be predicted, and A to I are the neighboring pixels that
are available at the time of prediction If we choose the prediction mode to be 0, then
the pixels a, e, i, and m are predicted based on the neighboring pixel A; pixels b, f, j and n are predicted based on pixel B, and so on Besides the 8 directional prediction
modes shown in the figure, there is a 9th mode, i.e., the DC prediction mode, or Mode
2 in H.264
H.264 video coding is based on the concept of rate distortion optimization, which means that the encoder has to encode the intra block using all the mode combinations and choose the one that gives the best RDO performance According to the structure of intra prediction in H.264, the number of mode combinations for luma and chroma
blocks in an MB is M8× (M4×16+M16), where M8, M4 and M16 represent the number
of modes for 8×8 chroma blocks, 4×4 and 16×16 luma blocks respectively This means
that, for a macroblock, it has to perform 4×(9×16+4)=592 different RDO calculations
Trang 32before a best RDO mode is determined As a result, the complexity of the encoder is extremely high
I
3
4 6 1 8
Figure 3.1 An example of intra prediction
3.2 Determining the Primary Edge Direction in the Image Block
It is observed that the pixels along the direction of local edge are normally of similar values (this is true for both luma and chroma components) Therefore, a good prediction could be achieved if we predict the pixels using those neighboring pixels that are in the same direction of the edge Figure 3.2 shows a few 4×4 edge patterns and their preferred intra predication directions
There are a number of ways to get the local edge directional information, such as edge direction histogram, directional fields [46] etc The algorithm described in this chapter is based on edge detection due to its simplicity in terms of computational complexity The rest of this section will explain in detail the fast intra prediction algorithm by using an edge direction histogram
Trang 33Figure3.2 Examples of 4×4 edge patterns and their preferred intra predication
directions
3.2.1 Edge Map
In order to obtain the edge information in the neighborhood of the intra block to be
predicted, the Sobel edge operator [47] is first applied to the video picture to generate
the edge map Each pixel in the video picture will then be associated with an element
in the edge map, which is the edge vector containing its edge direction and amplitude
Sobel operator has two convolution kernels One responds to degree of difference
in vertical direction and the other in horizontal direction For a pixel p i,j, in a luminance
(or chrominance) picture, the corresponding edge vector, Dr,j ={dx,j,dy,j}, can be
defined as,
1 , 1 , 1 1
, 1 1 , 1 , 1 1
,
1
,
1 , 1 1 , 1
, 1 1 , 1 1 , 1
,
1
,
22
22
− +
− +
−
−
− + + + +
×+
×+
=
j i j i j
i j i j i j
i
j
j i j j
i j i j j
i
j
p p p
p p p
dy
p p p
p p p
dx
, (3.1)
Trang 34where dx i,j and dy i,j represent the degree of difference in vertical and horizontal
directions respectively Therefore, the amplitude of the edge vector can be decided by,
j j
j dx dy
D
In fact the amplitude could be obtained more accurately using the rooted sum of
the squares of dx i,j and dy i,j However, Equation (3.2) is computationally much more
attractive The direction of the edge (in degree) is decided by the hyperfunction,
0 ,
, , 0
In the actual implementation of the algorithm, Equation (3.3) is not necessary, as in
H.264 there are only a limited number of directions that intra prediction could be
applied In fact, simple thresholding technique will be used to build up the edge
direction histogram instead
0 200 400 600 800 1000 1200
Edge Direction Histogram
Trang 353.2.2 Edge Direction Histogram
In order to decide whether the image block contains an edge, and how strong this edge is, an edge direction histogram is calculated from all the pixels in the block by summing up the amplitudes of those pixels with similar directions in the block
3.2.2.1 4×4 luma block edge direction histogram
In the case of a 4×4 luma block, there are 8 directional prediction modes, as shown
in Figure 3.1, plus a DC prediction mode The border between any two adjacent directional prediction modes is the bisector of the two corresponding directions For example, the border of mode 1 (00) and mode 8 (26.60) is the direction at 13.30, this is because that for Mode 8 (Horizontal-Up), prediction is done at an angle of approximately 26.60 above horizontal direction It is important to note that mode 3 and mode 8 are adjacent due to circular symmetry of the prediction modes The mode of each pixel is determined by its edge direction Ang(Dri , j)
Therefore, the edge direction histogram of a 4×4 luma block is decided by the following algorithm For each pixel in a 4×4 luma block, let histo(k), k = 0,1,…,8, be the histogram cell of the prediction mode k, and let η =dy,j/dx,j, then
Trang 36( Amphisto(8)
)1989.06682
)6682.04966
)1.49660273
)7302.54966
)1.49666682
)6682.01989
)1989.0 |(|
if
else
)( Amphisto(0)
)}
5.0273 |
(|
or )]
0(
and )0{[(
if
, , , , , , ,
,
, ,
j j j j j j j
j
j j
D D D D D D D
D
dy dx
rrrrrrrr
=+
≤
<
=+
≤
<
=+
≤
<
=+
≤
=+
>
≠
=
ηηηηηηη
η
(3.4)
Note that Mode 2 is not included in the above algorithm This is because that Mode 2 will always be chosen as one of the candidate mode Figure 3.3 shows an example of the edge direction histogram
3.2.2.2 Edge direction histogram for 16 ×16 luma and 8 × 8 chroma block
In the case of 16×16 luma and 8×8 chroma blocks, there are only two directional prediction modes, plus a plane prediction and a DC prediction mode Therefore, the edge direction histogram for this case will be based on three directions, i.e., horizontal, vertical and diagonal (plane) directions, as shown in Figure 3.4
Trang 370 3
1
Figure 3.4 Intra 8×8 and 16×16 prediction mode directions
The edge direction histogram for 16×16 luma and 8×8 chroma blocks is
constructed as follows,
)
( Amphisto(3)
else
)( Amphisto(1)
)4142.0 |
|(
if
else
)( Amphisto(0)
)4142
j j j
D D D
rrr
=+
=+
≤
=+
Trang 38For the similar reason, Mode 2 is missing in the above algorithm An example of such edge direction histogram is shown in Figure 3.5 Note for 8 x 8 chroma blocks, the similar equation of the above is applied, except that the order of the mode numbers is different
As mentioned above, each cell in the edge direction histogram sums up the amplitudes of those pixels with similar edge directions in the block Alternatively, the edge direction histogram counts the number of pixels with the similar edge directions
Therefore, based on the principle of majority voting, the cell k with the maximum
amplitude indicates that there is a strong edge along this direction in the image block and such are considered the preferable prediction direction The mode whose direction
complies with such k is chosen as the primary prediction mode
3.3 Mode Decision for Intra Prediction
Based on the primary prediction mode determined previously, the fast mode
decision algorithms for intra prediction select a small number of the prediction modes
to be the candidate prediction modes for RDO computation It should be noted that, the actual RDO computation in H.264 intra coding is based on the reconstructed images, while the edge directional histogram is calculated from the original lossless images, the primary prediction mode decided above will not always be the best RDO mode in actual coding Thus a number of ways have been tried in deciding the number of preferred modes, as is detailed in the following
Trang 39Method 1: The mode with maximum amplitude in the edge directional histogram is
chosen as the candidate prediction mode, and if this amplitude is below a predefined threshold, the prediction mode will be chosen as DC
Method 2: This method simply takes DC mode as the candidate mode besides the
primary prediction mode This will eliminate the side effect generated by Method 1, in
which thresholds lead to different performances in different sequences
Method 3: During the experiments of Method 2, it is observed that, besides the
primary prediction modes, the best RDO mode is always one of the two adjacent modes (in terms of direction) of the primary prediction modes selected by Method 2
Therefore, the two additional candidate prediction modes are fixed to be the two
neighbors of the primary prediction mode in terms of directions (Refer to Figure 3.1)
Method 4: In this method, additional information is added based on Method 3 The
window size of the histogram computation is enlarged, by including pixels in the left column and upper row of the block of interest This is due to fact that a block of interest is predicted by the pixels above and/or to the left of the block
Experimental results have shown that Method 3 achieves a good balance between computational time and coding efficiency, and the rest of this section will describe the detailed implementation of this algorithm However, the experiment section will still present the comparison among all the methods
Trang 403.3.1 4x4 Luma Block Prediction Modes
In the edge detection based approach, the histogram cell with the maximum amplitude is the best candidate for intra prediction In the case that all the cells have the similar amplitudes, DC mode will be a better choice, thus an amplitude threshold is needed in deciding whether the intra block exhibits strong edge presence or is just a flat region However, it is difficult to pre-define a universal threshold that suit for different block context and different video sequences Therefore, DC mode is always chosen as the second candidate in the RDO operation
Extensive experiments also show that, the best RDO mode is, besides the primary prediction mode, one of the two adjacent modes (in terms of direction) of the primary prediction modes selected by the proposed algorithm The main cause for this
phenomena is that H.264 RDO is based on the reconstructed intra lossy images, while the edge directional histogram is calculated from the original lossless images Therefore two additional candidate prediction modes are fixed to be the two neighbors
of the primary prediction mode in terms of directions For example, if the primary prediction mode is Mode 1, then two additional candidate prediction modes will be
Mode 8 and Mode 6 Note that Mode 8 and Mode 3 are adjacent modes in terms of directions due to the symmetry of the circle
In summary, in 4×4 luma block intra coding, the histogram cell with the maximum amplitude, and its two adjacent cells, plus DC mode are chosen to take part in RDO calculation Therefore, for each 4×4 luma block, only four RDO calculations will be performed instead of nine
3.3.2 16x16 Luma Block Prediction Modes
Based on the same observation above, only the primary prediction mode decided
by edge direction histogram is considered as a candidate of best prediction mode, and