A high performance VLSI architecture for integer motion estimation in HEVC

A High Performance VLSI Architecture for Integer Motion Estimation in HEVC XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K.. Teng1,2 Shenzhen Key Lab of Advanced Communication

Trang 1

A High Performance VLSI Architecture for Integer Motion

Estimation in HEVC

XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K F Teng1,2

Shenzhen Key Lab of Advanced Communication and Information Processing

Abstract

A high performance VLSI architecture for integer

motion estimation (IME) in High Efficiency Video

Coding (HEVC) is presented in this paper It supports

coding tree block (CTB) structure with the asymmetric

motion partition (AMP) mode The architecture

contains two parallel sub-architectures to meet

1080p@30fps real-time video coding The size L×L of

CTB in the architecture is set to L=32 pixels by default,

and it can be extended to L=64 and L=16 pixels A

serial mode decision module to find optimal partition

mode for the architecture has also been implemented

1 Introduction

High Efficiency Video Coding (HEVC) is the

recent video coding standard of the ITU-T Video

Coding Experts Group (VCEG) and the ISO/IEC

Moving Picture Experts Group (MPEG)

standardization organizations [1] The bit-rate

reduction and equal perceptual video quality have been

demonstrated in the HM10.0 Compared to previous

video coding standards, HEVC has many new concepts,

such as quadtree structure, asymmetry motion

prediction (AMP) [2] in integer motion estimation

(IME), etc., resulting in higher coding efficiency and

more design complexity The IME is the critical part of

video coding design because of the high memory

bandwidth, high hardware cost, complex control logic,

etc Therefore, the high performance architecture of

IME is important for the HEVC encoder Many IME

VLSI architectures have been studied targeting at various standards (e.g H.264): Cao Wei et al has proposed a reconfigurable architecture for VBSME in H.264 with memory partition scheme [3]; G.A Ruiz et

al has proposed an efficient VLSI processor including Lagrangian cost module and mode decision module [4]; Tuan et al has defined four levels of data-reuse scheme according to memory situations [5], etc However, few architectures targeting IME in HEVC have been reported so far

This paper studies a parallel VLSI architecture for IME in HEVC This structure can support AMP mode aiming at high resolution application A serial mode decision module to find optimal partition mode has also been implemented

2 HEVC Motion Estimation Theory

HEVC is the theory of motion estimation for the VLSI architecture studied in this paper The coding object in HEVC is CTB Its size can be represented as LhL (L=16, 32, 64), while the traditional macroblock size is 16h16 CTB is further partitioned into coding blocks (CBs) or one CB according to a quadtree as shown in figure 1 The root of quadtree is CTB

The size of CBs can be represented as MhM (M=8,

16, 32, 64) Figure 1(a) shows a corresponding trellis diagram of quadtree decomposition, smaller CBs are typically distributed around the Object boundary CB is further partitioned into prediction blocks (PBs) through three modes for inter prediction, as shown in figure 2 They are two square modes, MhM, M/2hM/2; two

Trang 2

symmetric modes, M/2 h M, M h M/2; and four

asymmetric modes, M/4hM (L), M/4hM (R), Mh

M/4 (U), MhM/4 (D) M denotes its corresponding

parent CB size

2EMHFWERXUGDU\

(a) (b)

Figure 1 Example of quadtree structure (a) Quadtree

structure (b) Corresponding trellis diagram

Figure 2 Modes for splitting a CB into PBs

Meanwhile, three constraints must be complied with:

(1) Asymmetric motion partition mode is turning off

when M=8,

(2) For reducing the memory bandwidth, 4h4 PBs

are not allowed for inter prediction,

(3) 4h8 PBs and 8h4 PBs are only adopted in

uni-predictive coding

3 VLSI Architecture

With the newly introduced HEVC motion

estimation theory, a high performance VLSI

architecture for integer motion estimation has been

studied The Top-level of the architecture is shown in

figure 3 On-chip RAM includes current CTB and

search area Once the size of CTB is decided, the

complexity of IME is also determined The

size of CTB is 64h64, a quarter down sampling module is designed to reduce the search points This strategy strikes a good balance between hardware resources and compression quality Other schemes,

[5], have also been used in the architecture design The full search algorithm scheme has the advantages of computational regularity and excellent output video quality The Level D scheme reuse pixels data in the entire search window strips of a consecutive current block

%HVW6$'V

$;,,QWHUFRQQHFW 6HDUFK

$UHD

&XUUHQW

&7%

6KLIWB5HJV

3(

DUUD\

3(

DUUD\

2QFKLS5$0

5HIHUHQFH )UDPH

&XUUHQW )UDPH ''5

0RGH 'HFLVLRQ

Figure 3 Top-level of design

A 32h33 pixels 2D three direction shift register array is proposed to improve data fetching efficiency Since two processing element (PE) arrays are used in parallel, two consecutive reference candidate blocks are stored in the shift register array When the next two reference candidates have been pushed into the shift register array, 32h2 pixels are updated, whereas the other 32h31 pixels are reused In order to eliminate bubble clock cycles, a column of 33 pixels is added to the array, so the array size is changed to 33h33 pixels Search window can be scanned in three directions: upward, downward, left to right When the direction is left to right, the shift register array uses one cycle to update the requested 33 pixels For typical search range [-24, 23], it takes (24+24)/2=24 clock cycles to finish one column data matching and one clock cycle to shift

to another column 24h48=1152 clock cycles are needed for one CTB processing

Trang 3

Processing element (PE) is for calculating the

differences of a pair of pixels which are used for Sum

of Absolute Differences (SADs) of PBs The size of PB

decides the number of PEs One PE array includes

1024 PEs Two PE arrays can concurrently calculate 2

h 145 SADs in one clock cycle The contents of 145

SADs are shown in table 1

In order to implement the CTB quadtree partition,

each PE array can be divided into 16h16 execution

units (16h16 EUs) as shown in figure 4(a) Each 16h

16 EU is divided into sixteen 4h4 EUs as shown in

figure 4 (b) When CTB size is 64h64, quarter down

sampling will be performed first to reduce the number

of SADs from 593 to 145 in order to make tradeoff between the performance and precision

Table 1 Content of 145 SADs

Number

Note: * represents AMP mode

(a) PE array architecture

5HIHUHFH

GDWD

&XUUHQW

GDWD

(8

[

(8

[DQG[6$'V

[6$'V

(8

(8 (8 (8 (8

(b) 16h16 EU architecture Figure 4 The design architecture

Trang 4

Hierarchical adder trees are built to help PE array

to generate 145 SADs for each shift operation, as

shown in figure 4(a) and 4(b) The best 145 SADs

selected by comparator modules will be sent to the

mode decision module

The mode decision module proposed in this paper

is based on the structure presented by G.A Ruiz [4]

The improvement of the circuit is shown in figure 5

Three extra adders are used to sum up the four 16×16

blocks’ cost for eliminating the bubble clock cycles of

shifting data, and sixteen registers stored the blocks’

cost to meet the requirement of HEVC

$FFXPXODWRU

&203$5$725

UHJ UHJ

&267

Figure 5 Mode Decision architecture

4 Results and Performance Analysis

The proposed architecture has been implemented in

Verilog, simulated and verified by ModelSim The

Verilog code has been synthesized, placed and routed

into Xilinx Virtex-6 XC6VLX-550T using Xilinx ISE

tool The synthesized results are given in table 2

Table2 Resources Utilization of the FPGA

Resource utilization percentage

slice logic 55346 16%

slice register 19744 2.9%

Bram 148kB 5.2%

The performance of the IME is related to the size of

search window The default range of displacement is

[-24, 23] It takes 1152 clock cycles for one CTB

processing The architecture can process over 70K CTBs per second at system clock 110M considering the initiate clock cycles It can meet the requirement of 1080p@30fps video which need 61K CTBs per second

5 Conclusion

A high performance parallel VLSI architecture for integer motion estimation has been studied It can meet real-time HEVC encoding requirements for 1080p@30fps video The architecture has been implemented on Virtex-6 XC6VLX-550T with 110M system clock When implemented as ASIC, the number

of parallel PE arrays would be reduced as system clock increased

Reference

[1] B Bross, et al “High Efficiency Video Coding

(HEVC) Text Specification Draft 9”, document

JCTVC-K1003, ITU-T/ISO/IEC Joint collaborative Team on Video coding (JCT-VC), (Oct.2012)

[2] Gary J Sullivan, et al “Overview Of The High

Efficiency Video Coding (HEVC) Standard”, IEEE

Trans Circuits Syst Video Technol., vol 22, no 12,

pp 1648–1667, (Dec 2012)

[3] Cao Wei, et al “A High-Performance

Reconfigurable VLSI Architecture for VBSME in H.264”, IEEE Trans Consumer Electron., vol 54,

no 3, pp 1338–1345, (Aug 2008)

[4] G.A Ruiz and J.A Michell, “An efficient VLSI

processor chip for variable block size integer motion estimation in H.264/AVC”, Signal

Processing: Image Communication, vol 26, pp 289-303, (July 2011)

[5] Tuan J-C, et al “on the data reuse and memory

bandwidth analysis of full-search block-matching VLSI architecture”, IEEE Trans Circuits Syst

Video Technol., 12(1)pp.61-72, (2002)

Định dạng
Số trang	4
Dung lượng	298,73 KB