Region of interest tracking method for video plus depth coding

Region-of-interest tracking method for video plus depth coding Conference Paper · September 2014 DOI: 10.1109/ELINFOCOM.2014.6914363 CITATIONS 0 READS 12 5 authors, including: Hung Quang

Trang 1

Region-of-interest tracking method for video plus depth coding

Conference Paper · September 2014

DOI: 10.1109/ELINFOCOM.2014.6914363

CITATIONS

0

5 authors, including:

Hung Quang Bui

Vietnam National University, Hanoi

36 PUBLICATIONS 25 CITATIONS

SEE PROFILE

Ha Le Vietnam National University, Hanoi

38 PUBLICATIONS 115 CITATIONS

SEE PROFILE

Trang 2

Region-of-Interest Tracking Method for Video Plus Depth Coding

Nam Pham Thanh, Thanh Nguyen Xuan, Hung Luu Viet, Hung Bui Quang, Ha Le Thanh

Human-Machine Interaction Laboratory University of Engineering and Technology, Vietnam National University, Hanoi

{nampt.mi12, thanhnx_54, hunglv_54, hungbq, ltha}@vnu.edu.vn

Abstract

3D video can bring to viewers an exhilarating

experience, however, it contains more details compared to

2D video; thus, it demands higher bandwidth to transmit

information to the viewers In order to overcome this

hurdle, our paper aims to propose a resolution that helps

to detect and track video regions interested by viewers

These video regions may then be coded with higher bitrate

than others to preserve the visual quality while the total

video bitrate is reduced Experimental results showed the

effectiveness of our method and it can be applied to

interactive video coders

Keywords: video plus depth, ROI, tracking, detection

1 Introduction

In video communication, visual attention has been proved

that it affects the perceived content of video for viewers In

addition, managing the quality of videos’ content is

extremely challenging, especially when it comes to

reducing the size Regions of Interest (ROI) coding is an

useful approach to shrink the size of video transmitted

without decreasing perceived quality for views at ROI

regions To bring viewers the highest perceived quality of

a video, it is necessary to detect and track ROI Video plus

depth is a kind of video containing both color information

and depth information which can detect and track ROI

more efficiently

To model the attention objects in [1], Han et al used

three attributes: attention value, edge set, and homogeneity

measure which makes this method fairly complicated Due

to the complexity of algorithm of ROI detection, it is

difficult to apply in portable devices in [1], [2].In [3], ROI

is detected as a pre-process of video coding by using both

luminance and chrominance information with skin color

matching They consider not only human faces but also

hands as ROI There are two methods for ROI detection in

[4] based on depth images, with or without skin-color

detection In [5], ROI extraction showed an unexpected

result that unimportant region is detected as ROI in some cases, and consequently reduce the efficiency of coding without increasing perceived quality Using a lot of information to extract ROI like depth, illumination, motion and contour, [6] and [8] give a complex algorithm to extract ROI in 3D multi-view video In videos, face is usually the most interesting region but extracting only face

as ROI like in [7] is not enough, the visual attention of human can focus on more objects in a video

From the above reasons, in this paper, we propose a method for tracking ROI efficiently in order to encode video plus depth There are three main steps to analyze a video: detect the ROIs, track those ROIs and predict ROIs’ movement To be more specific, ROI detection is based on flood-fill algorithm while ROI tracking uses either block matching algorithm or motion vectors from H.264 encoder The rest of our paper is organized as following: ROI detection, tracking, some related problems and solutions are presented in section 2 In section 3, some performed experiments are depicted to demonstrate the efficiency of the proposed method Last but not least, section 4 will conclude all the remarkable aspects of this research

2 Proposed method for ROI detection and tracking

In order to encode video with ROI, it is necessary to detect the position of ROI in the first frame and track ROI

in subsequent frames In this research, we build an interactive video coding model in which viewers are able

to choose their ROIs in the form of touching or clicking on the video display screen Then, the touched location is recorded and signaled to video encoder Given an initial touching location, the video encoder detects and tracks ROIs basing on both color and depth information obtained from video plus depth sequences Finally, the detected ROI regions are coded at higher bitrate than the non-ROI regions in order to significantly reduce the total encoded information while the video quality in ROIs is still acceptable by viewers However, the detail of ROIs coding

is beyond the scope of this work

96

Trang 3

2.1 ROI detection

Each frame of 2D plus depth video sequence consists of

a monoscopic 2D color frame and a depth image frame,

which contains the information about depth of objects in

the video Observations show that any two adjacent points

belonging to a same object share the similarity in depth

values From the inputted point, called anchor point, ROI

is detected by using flood-fill algorithm The basic idea of

this algorithm is to visit every pixel in the frame that is

neighbor to the anchor point and mark these points as part

of ROI if they have similar features to anchor point or

other points of ROI Two adjacent neighbor pixels are

considered as in the same region if the difference between

their depth values is smaller than a defined threshold

2.2 ROI-off problem and solution

In some cases, two distinct objects share the same depth

value Therefore, flood fill algorithm probably enlarges the

ROI region to cover two or more objects, which results in

coding efficiency degradation

In flood-fill algorithm, we used a queue Q to store

points before they are checked to belong ROI region or

not Our solution for this problem is based on the

observation of the increment of the pixels in queue Q In

the correct detection case, the size of this queue has just

one explosion period which its size increases sharply In

ROI-off detection case, there are two explosion increment

periods, and the second increment indicates that ROI

extraction is expanding unexpected regions

Fig 1 shows the case of ROI detection in “Ballet” video

when flood-fill algorithm extracts the region of the dancer

In this case, the floor would be mistakenly considered as

the same region with the dancer in the original algorithm,

resulting in a much larger ROI When ROI is extracted

from the dancer in this case, the sizes of queue Q in ten

continuous frames are represented in Fig 2 It can be

easily seen from this figure that in all examined frames, the

size of queue Q always has two explosion periods, in

which the first period shows the expansion the dancer

while the second one demonstrates the extraction of the

floor

From the observation, it is considered that when the size

of queue Q starts to explode the second time, the detection

process begins to recognize the wrong region rather than

the expected ROI Therefore, we apply a cut-off method

inside flood-fill algorithm to observe the size of queue

Fig 1 A depth frame of “Ballet” video The point at location (x,y) is inputted from user as a feature of ROI The pixels in transition region between dancer’s legs and the floor have same depth values hence our flood-fill algorithm recognizes these two regions as ones

Fig 2 The sizes of queue Q in 10 continuous frames in

“Ballet” video

The flood-fill algorithm will stop to extract the ROI when we observe the second explosion of the size of queue

Q, and consequently the unexpected region could not be considered as ROI The size of queue Q in the same 10 continuous frames is shown in Fig 3

Fig 3 The development of queue Q’s size in 10 frames

in “Ballet” video when the cut-off method is applied

2.3 ROI tracking

In order to improve ROI extraction efficiency, ROI is tracked after being detected After the information of ROI

in the first frame is collected, it is necessary to keep track

of this ROI in subsequent frames since this information

Trang 4

will be used for encoding ROI There are two methods

utilized to follow the movement of ROI

Firstly, an independent module was implemented to

estimate based motion vectors based on

block-matching algorithm The main idea of block-block-matching

technique is to divide each frame to so-called macroblocks

and find the best match block by comparing the absolute

difference of all pixels of two blocks This module worked

well but consumed much time because of executing many

complex calculations Another solution with low

computational complexity and high performance is to

reuse motion block-based motion vectors from H.264

encoder

The whole ROI is divided into macroblocks For a

macroblock at location ሺܽǡ ܾሻ which has motion vector of

ሺ݉ǡ ݊ሻ, its location at next frame is predicted as below:

ሺܽᇱǡ ܾԢሻ ൌ ሺܽǡ ܾሻ ൅ ሺ݉ǡ ݊ሻ (1)

Fig 4 Motion estimation of ROI

In the Fig 4, the frame is divided into macroblocks,

which are represented by green dots The man is moving

hence all macroblocks in that region have their own

motions When the ROI moves, in two continuous frames

of a video sequence, the object features such as shape,

color usually remain or change slightly Therefore, the

motion vector of the whole ROI region is estimated by

motions of all macroblocks in that ROI as below:

൫߱ഥ௫ǡ ߱ഥ௬൯ ൌσ ቀఠഥೣ೔ǡఠഥ೤೔ቁ

೙

೔సభ

௡ (2) where ݊ is the total number of macroblocks in ROI,

൫߱ഥ௫೔ǡ ߱ഥ௬೔൯ is the motion vector of ݅௧௛ macroblock To

represent the movement of ROI after each frame, we

estimate the movement of anchor point by the approximate

motion vector of ROI as the following formula:

൫ܽ௞ାଵǡܾ௞ାଵ൯ ൌ ሺܽ௞ǡ ܾ௞ሻ ൅ ൫߱ഥ௫ǡ ߱ഥ௬൯ (3)

in which ሺܽ௞ǡ ܾ௞ሻ is the location of anchor point at ݇௧௛

frame and ሺܽଵǡ ܾଵሻ is the inputted point at the first frame

2.4 Cumulative error

When ROI is tracked in video sequence by motion

vectors, there is a small error through each frame If these

errors are not solved after each frame, they can cause

cumulative error This error is repeated after a large

number of frames and obviously, causes the cumulative

error and hence ROI is tracked incorrectly At each frame,

after computing vector ߥҧ, we took an additional step to check whether the current object is the one we are looking for by comparing the depth value of anchor-point to the recorded average depth value of ROI in the previous frame with a threshold value from the experiment Then the object detection which uses flood fill algorithm is reapplied to extract ROI and anchor-point‘s location is defined by the average location of all pixels in ROI:

ሺݔǡ ݕሻ ൌσ೙೔సభሺ௫೔ ǡ௬೔ሻ

௡ (4)

in whichሺݔǡ ݕሻ is the current location of anchor point, ݊ is the total number of pixels in ROI

2.5 Occlusion problem

In ROI tracking process, there is a situation that the under tracked ROI suddenly disappears As can be seen from Fig 5, ROI is the man in white region who is moving behind another man In Fig 6, ROI almost disappears and, theoretically, system can no longer track this ROI In order

to improve the robustness of our approach, we predict the most likely object’s location when they reappear When an object disappears, our approach enters a special mode called Prediction Mode (PM) The basic idea of the PM rests on two assumptions:

a) Video object normally moves with a constant velocity In previous frames, the object’s velocity data is collected In fact, the velocity of the object is proportional

to the movement of the object through each frame, thus we use the vector ࣏ഥ to represent object velocity Basing on the velocity data, we predict the new position where it can

reappear

b) We only apply PM in a defined period of time From the observation, it can be seen that the probability ROI reappears is very small Hence, a threshold constant TIME_OUT = 3 (seconds) is defined After this period, we

TABLE I ROI DETECTION AND TRACKING RESULT

Moving ballet dancer

Indoor, stable camera, resolution ͳͲʹͶ ൈ ͹͸ͺ

Detection: 80%

Tracking: good Moving

rectangular box

Indoor, stable camera, resolution

͸ͶͲ ൈ ͶͺͲ

Detection: 100%

Tracking: good

camera, resolution

͸ͶͲ ൈ ͶͺͲ

Detection: 100%

Tracking: good

camera, resolution

͸ͶͲ ൈ ͶͺͲ

Detection: 100%

Tracking: good Moving man

(disappears)

Indoor, stable camera, resolution

͸ͶͲ ൈ ͶͺͲ

Detection: 90%

Tracking: medium

98

Trang 5

consider that ROI disappear permanently and stop

predicting its location if it does not reappear

Fig 5 ROI is moving into invisible area, behind the

standing man

Fig 6 ROI disappears

3 Experimental results

Several videos with different conditions (camera

position, resolution) and different kinds of ROI (moving

person, standing person, ball, rectangular box ) have been

tested Table 1 shows the result in details The detection

accuracy is assessed by capturing the size of queue Q As

mentioned above (ROI detection using depth frame),

frames that have one explosion of queue size or applied

successful cut-off method are considered as true detection

frames The result of detection and tracking is shown in

Table 1

4 Conclusion

A method for detecting and tracking ROI in 2D video

plus depth by using depth information is briefly presented

in this paper In our method, viewer can provide the prior

information of the interesting video regions by clicking or

touching on the display screen; flood-fill algorithm is

responsible to detect the ROI exactly Afterwards, motion

vectors of ROIs are extracted to predict ROI’s movement

We succeed in detecting the correct ROIs and tracking

them in subsequent frames which help not only increasing

the perceived quality of video in ROI region but also

reduce the bit rate of video coding

5 Acknowledgment

This work was supported by the basic research projects

in natural science in 2012 of the National Foundation for Science & Technology Development (Nafosted), Vietnam (102.01-2012.36, Coding and communication of multiview video plus depth for 3D Television Systems)

References

[1] Junwei Han, King N Ngan, Mingjing Li, and Hong-Jiang Zhang (2006, Jan.) Unsupervised extraction

of visual attention objects in color images IEEE Trans Circuits and Systems for Video Technology

[Online] 16(1), pp 141–145

[2] Yang Wang, Kia-Fock Loe, Tele Tan, and Jian-Kang

Wu (2005, Jul.) Spatiotemporal video segmentation

based on graphical models IEEE Trans Image Processing [Online] 14(7), pp 937–947

[3] Minghui Wang, Tianruo Zhang, Chen Liu, and Satoshi Goto (2009) Region-of-interest based H.264 encoding parameter allocation for low power video communication Presented at

5th International Colloquium on Signal Processing &

Its Applications (CSPA)

[4] L S Karlsson, M Sjӧstrӧm (2008, May.) Region-of-interest 3D video coding based on depth images

Presented at 3DTV Conference

[5] D V S X De Silva, W A C Fernando, and S L P

Yasakethu (2009, Aug.) Object based coding of the

depth maps for 3D video coding IEEE Trans

Consumer Electronics [Online] 55(3), pp 1699–

1706

[6] Yun Zhang, Mei Yu, Gang-Yi Jiang (2009, Jul.) Depth based region of interest extraction for multi-view video coding Presented at

2009 International conference on machine learning and cybernetics

[7] Tianruo Zhang, Chen Liu, Minghui Wang, Satoshi Goto (2009, Oct.) Region-of-interest based H.264 encoder for videophone with a hardware macroblock level face detector Presented at IEEE International workshop on multimedia signal processing (MMSP 2009), pp 1-6

[8] Yun Zhang, gangyi Jiang, Mei Yu, You Yang, Zongju Peng, Ken Chen (2010, Jul.) Object based coding of

the depth maps for 3D video coding, Journal of visual communication and image representation [Online] 21(5-6), pp 498-512

Định dạng
Số trang	5
Dung lượng	859,27 KB