A High Performance VLSI Architecture for Integer Motion Estimation in HEVC XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K.. Teng1,2 Shenzhen Key Lab of Advanced Communication
Trang 1A High Performance VLSI Architecture for Integer Motion
Estimation in HEVC
XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K F Teng1,2
Shenzhen Key Lab of Advanced Communication and Information Processing
Abstract
A high performance VLSI architecture for integer
motion estimation (IME) in High Efficiency Video
Coding (HEVC) is presented in this paper It supports
coding tree block (CTB) structure with the asymmetric
motion partition (AMP) mode The architecture
contains two parallel sub-architectures to meet
1080p@30fps real-time video coding The size L×L of
CTB in the architecture is set to L=32 pixels by default,
and it can be extended to L=64 and L=16 pixels A
serial mode decision module to find optimal partition
mode for the architecture has also been implemented
1 Introduction
High Efficiency Video Coding (HEVC) is the
recent video coding standard of the ITU-T Video
Coding Experts Group (VCEG) and the ISO/IEC
Moving Picture Experts Group (MPEG)
standardization organizations [1] The bit-rate
reduction and equal perceptual video quality have been
demonstrated in the HM10.0 Compared to previous
video coding standards, HEVC has many new concepts,
such as quadtree structure, asymmetry motion
prediction (AMP) [2] in integer motion estimation
(IME), etc., resulting in higher coding efficiency and
more design complexity The IME is the critical part of
video coding design because of the high memory
bandwidth, high hardware cost, complex control logic,
etc Therefore, the high performance architecture of
IME is important for the HEVC encoder Many IME
VLSI architectures have been studied targeting at various standards (e.g H.264): Cao Wei et al has proposed a reconfigurable architecture for VBSME in H.264 with memory partition scheme [3]; G.A Ruiz et
al has proposed an efficient VLSI processor including Lagrangian cost module and mode decision module [4]; Tuan et al has defined four levels of data-reuse scheme according to memory situations [5], etc However, few architectures targeting IME in HEVC have been reported so far
This paper studies a parallel VLSI architecture for IME in HEVC This structure can support AMP mode aiming at high resolution application A serial mode decision module to find optimal partition mode has also been implemented
2 HEVC Motion Estimation Theory
HEVC is the theory of motion estimation for the VLSI architecture studied in this paper The coding object in HEVC is CTB Its size can be represented as LhL (L=16, 32, 64), while the traditional macroblock size is 16h16 CTB is further partitioned into coding blocks (CBs) or one CB according to a quadtree as shown in figure 1 The root of quadtree is CTB
The size of CBs can be represented as MhM (M=8,
16, 32, 64) Figure 1(a) shows a corresponding trellis diagram of quadtree decomposition, smaller CBs are typically distributed around the Object boundary CB is further partitioned into prediction blocks (PBs) through three modes for inter prediction, as shown in figure 2 They are two square modes, MhM, M/2hM/2; two
978-1-4673-6417-1/13/$31.00 ©2013 IEEE
Trang 2symmetric modes, M/2 h M, M h M/2; and four
asymmetric modes, M/4hM (L), M/4hM (R), Mh
M/4 (U), MhM/4 (D) M denotes its corresponding
parent CB size
2EMHFWERXUGDU\
(a) (b)
Figure 1 Example of quadtree structure (a) Quadtree
structure (b) Corresponding trellis diagram
Figure 2 Modes for splitting a CB into PBs
Meanwhile, three constraints must be complied with:
(1) Asymmetric motion partition mode is turning off
when M=8,
(2) For reducing the memory bandwidth, 4h4 PBs
are not allowed for inter prediction,
(3) 4h8 PBs and 8h4 PBs are only adopted in
uni-predictive coding
3 VLSI Architecture
With the newly introduced HEVC motion
estimation theory, a high performance VLSI
architecture for integer motion estimation has been
studied The Top-level of the architecture is shown in
figure 3 On-chip RAM includes current CTB and
search area Once the size of CTB is decided, the
complexity of IME is also determined The
size of CTB is 64h64, a quarter down sampling module is designed to reduce the search points This strategy strikes a good balance between hardware resources and compression quality Other schemes,
[5], have also been used in the architecture design The full search algorithm scheme has the advantages of computational regularity and excellent output video quality The Level D scheme reuse pixels data in the entire search window strips of a consecutive current block
%HVW6$'V
$;,,QWHUFRQQHFW 6HDUFK
$UHD
&XUUHQW
&7%
6KLIWB5HJV
3(
DUUD\
3(
DUUD\
2QFKLS5$0
5HIHUHQFH )UDPH
&XUUHQW )UDPH ''5
0RGH 'HFLVLRQ
Figure 3 Top-level of design
A 32h33 pixels 2D three direction shift register array is proposed to improve data fetching efficiency Since two processing element (PE) arrays are used in parallel, two consecutive reference candidate blocks are stored in the shift register array When the next two reference candidates have been pushed into the shift register array, 32h2 pixels are updated, whereas the other 32h31 pixels are reused In order to eliminate bubble clock cycles, a column of 33 pixels is added to the array, so the array size is changed to 33h33 pixels Search window can be scanned in three directions: upward, downward, left to right When the direction is left to right, the shift register array uses one cycle to update the requested 33 pixels For typical search range [-24, 23], it takes (24+24)/2=24 clock cycles to finish one column data matching and one clock cycle to shift
to another column 24h48=1152 clock cycles are needed for one CTB processing
Trang 3Processing element (PE) is for calculating the
differences of a pair of pixels which are used for Sum
of Absolute Differences (SADs) of PBs The size of PB
decides the number of PEs One PE array includes
1024 PEs Two PE arrays can concurrently calculate 2
h 145 SADs in one clock cycle The contents of 145
SADs are shown in table 1
In order to implement the CTB quadtree partition,
each PE array can be divided into 16h16 execution
units (16h16 EUs) as shown in figure 4(a) Each 16h
16 EU is divided into sixteen 4h4 EUs as shown in
figure 4 (b) When CTB size is 64h64, quarter down
sampling will be performed first to reduce the number
of SADs from 593 to 145 in order to make tradeoff between the performance and precision
Table 1 Content of 145 SADs
Number
Number
Note: * represents AMP mode
(a) PE array architecture
5HIHUHFH
GDWD
&XUUHQW
GDWD
(8
(8
(8
[
(8
(8
(8
(8
(8
(8
(8
(8
[DQG[6$'V
[DQG[6$'V
[DQG[6$'V
[6$'V
(8
(8 (8 (8 (8
(b) 16h16 EU architecture Figure 4 The design architecture
Trang 4Hierarchical adder trees are built to help PE array
to generate 145 SADs for each shift operation, as
shown in figure 4(a) and 4(b) The best 145 SADs
selected by comparator modules will be sent to the
mode decision module
The mode decision module proposed in this paper
is based on the structure presented by G.A Ruiz [4]
The improvement of the circuit is shown in figure 5
Three extra adders are used to sum up the four 16×16
blocks’ cost for eliminating the bubble clock cycles of
shifting data, and sixteen registers stored the blocks’
cost to meet the requirement of HEVC
$FFXPXODWRU
&203$5$725
UHJ UHJ
&267
Figure 5 Mode Decision architecture
4 Results and Performance Analysis
The proposed architecture has been implemented in
Verilog, simulated and verified by ModelSim The
Verilog code has been synthesized, placed and routed
into Xilinx Virtex-6 XC6VLX-550T using Xilinx ISE
tool The synthesized results are given in table 2
Table2 Resources Utilization of the FPGA
Resource utilization percentage
slice logic 55346 16%
slice register 19744 2.9%
Bram 148kB 5.2%
The performance of the IME is related to the size of
search window The default range of displacement is
[-24, 23] It takes 1152 clock cycles for one CTB
processing The architecture can process over 70K CTBs per second at system clock 110M considering the initiate clock cycles It can meet the requirement of 1080p@30fps video which need 61K CTBs per second
5 Conclusion
A high performance parallel VLSI architecture for integer motion estimation has been studied It can meet real-time HEVC encoding requirements for 1080p@30fps video The architecture has been implemented on Virtex-6 XC6VLX-550T with 110M system clock When implemented as ASIC, the number
of parallel PE arrays would be reduced as system clock increased
Reference
[1] B Bross, et al “High Efficiency Video Coding
(HEVC) Text Specification Draft 9”, document
JCTVC-K1003, ITU-T/ISO/IEC Joint collaborative Team on Video coding (JCT-VC), (Oct.2012)
[2] Gary J Sullivan, et al “Overview Of The High
Efficiency Video Coding (HEVC) Standard”, IEEE
Trans Circuits Syst Video Technol., vol 22, no 12,
pp 1648–1667, (Dec 2012)
[3] Cao Wei, et al “A High-Performance
Reconfigurable VLSI Architecture for VBSME in H.264”, IEEE Trans Consumer Electron., vol 54,
no 3, pp 1338–1345, (Aug 2008)
[4] G.A Ruiz and J.A Michell, “An efficient VLSI
processor chip for variable block size integer motion estimation in H.264/AVC”, Signal
Processing: Image Communication, vol 26, pp 289-303, (July 2011)
[5] Tuan J-C, et al “on the data reuse and memory
bandwidth analysis of full-search block-matching VLSI architecture”, IEEE Trans Circuits Syst
Video Technol., 12(1)pp.61-72, (2002)