Region-of-interest tracking method for video plus depth coding Conference Paper · September 2014 DOI: 10.1109/ELINFOCOM.2014.6914363 CITATIONS 0 READS 12 5 authors, including: Hung Quang
Trang 1Region-of-interest tracking method for video plus depth coding
Conference Paper · September 2014
DOI: 10.1109/ELINFOCOM.2014.6914363
CITATIONS
0
READS 12
5 authors, including:
Hung Quang Bui
Vietnam National University, Hanoi
36 PUBLICATIONS 25 CITATIONS
SEE PROFILE
Ha Le Vietnam National University, Hanoi
38 PUBLICATIONS 115 CITATIONS
SEE PROFILE
Trang 2Region-of-Interest Tracking Method for Video Plus Depth Coding
Nam Pham Thanh, Thanh Nguyen Xuan, Hung Luu Viet, Hung Bui Quang, Ha Le Thanh
Human-Machine Interaction Laboratory University of Engineering and Technology, Vietnam National University, Hanoi
{nampt.mi12, thanhnx_54, hunglv_54, hungbq, ltha}@vnu.edu.vn
Abstract
3D video can bring to viewers an exhilarating
experience, however, it contains more details compared to
2D video; thus, it demands higher bandwidth to transmit
information to the viewers In order to overcome this
hurdle, our paper aims to propose a resolution that helps
to detect and track video regions interested by viewers
These video regions may then be coded with higher bitrate
than others to preserve the visual quality while the total
video bitrate is reduced Experimental results showed the
effectiveness of our method and it can be applied to
interactive video coders
Keywords: video plus depth, ROI, tracking, detection
1 Introduction
In video communication, visual attention has been proved
that it affects the perceived content of video for viewers In
addition, managing the quality of videos’ content is
extremely challenging, especially when it comes to
reducing the size Regions of Interest (ROI) coding is an
useful approach to shrink the size of video transmitted
without decreasing perceived quality for views at ROI
regions To bring viewers the highest perceived quality of
a video, it is necessary to detect and track ROI Video plus
depth is a kind of video containing both color information
and depth information which can detect and track ROI
more efficiently
To model the attention objects in [1], Han et al used
three attributes: attention value, edge set, and homogeneity
measure which makes this method fairly complicated Due
to the complexity of algorithm of ROI detection, it is
difficult to apply in portable devices in [1], [2].In [3], ROI
is detected as a pre-process of video coding by using both
luminance and chrominance information with skin color
matching They consider not only human faces but also
hands as ROI There are two methods for ROI detection in
[4] based on depth images, with or without skin-color
detection In [5], ROI extraction showed an unexpected
result that unimportant region is detected as ROI in some cases, and consequently reduce the efficiency of coding without increasing perceived quality Using a lot of information to extract ROI like depth, illumination, motion and contour, [6] and [8] give a complex algorithm to extract ROI in 3D multi-view video In videos, face is usually the most interesting region but extracting only face
as ROI like in [7] is not enough, the visual attention of human can focus on more objects in a video
From the above reasons, in this paper, we propose a method for tracking ROI efficiently in order to encode video plus depth There are three main steps to analyze a video: detect the ROIs, track those ROIs and predict ROIs’ movement To be more specific, ROI detection is based on flood-fill algorithm while ROI tracking uses either block matching algorithm or motion vectors from H.264 encoder The rest of our paper is organized as following: ROI detection, tracking, some related problems and solutions are presented in section 2 In section 3, some performed experiments are depicted to demonstrate the efficiency of the proposed method Last but not least, section 4 will conclude all the remarkable aspects of this research
2 Proposed method for ROI detection and tracking
In order to encode video with ROI, it is necessary to detect the position of ROI in the first frame and track ROI
in subsequent frames In this research, we build an interactive video coding model in which viewers are able
to choose their ROIs in the form of touching or clicking on the video display screen Then, the touched location is recorded and signaled to video encoder Given an initial touching location, the video encoder detects and tracks ROIs basing on both color and depth information obtained from video plus depth sequences Finally, the detected ROI regions are coded at higher bitrate than the non-ROI regions in order to significantly reduce the total encoded information while the video quality in ROIs is still acceptable by viewers However, the detail of ROIs coding
is beyond the scope of this work
96
Trang 32.1 ROI detection
Each frame of 2D plus depth video sequence consists of
a monoscopic 2D color frame and a depth image frame,
which contains the information about depth of objects in
the video Observations show that any two adjacent points
belonging to a same object share the similarity in depth
values From the inputted point, called anchor point, ROI
is detected by using flood-fill algorithm The basic idea of
this algorithm is to visit every pixel in the frame that is
neighbor to the anchor point and mark these points as part
of ROI if they have similar features to anchor point or
other points of ROI Two adjacent neighbor pixels are
considered as in the same region if the difference between
their depth values is smaller than a defined threshold
2.2 ROI-off problem and solution
In some cases, two distinct objects share the same depth
value Therefore, flood fill algorithm probably enlarges the
ROI region to cover two or more objects, which results in
coding efficiency degradation
In flood-fill algorithm, we used a queue Q to store
points before they are checked to belong ROI region or
not Our solution for this problem is based on the
observation of the increment of the pixels in queue Q In
the correct detection case, the size of this queue has just
one explosion period which its size increases sharply In
ROI-off detection case, there are two explosion increment
periods, and the second increment indicates that ROI
extraction is expanding unexpected regions
Fig 1 shows the case of ROI detection in “Ballet” video
when flood-fill algorithm extracts the region of the dancer
In this case, the floor would be mistakenly considered as
the same region with the dancer in the original algorithm,
resulting in a much larger ROI When ROI is extracted
from the dancer in this case, the sizes of queue Q in ten
continuous frames are represented in Fig 2 It can be
easily seen from this figure that in all examined frames, the
size of queue Q always has two explosion periods, in
which the first period shows the expansion the dancer
while the second one demonstrates the extraction of the
floor
From the observation, it is considered that when the size
of queue Q starts to explode the second time, the detection
process begins to recognize the wrong region rather than
the expected ROI Therefore, we apply a cut-off method
inside flood-fill algorithm to observe the size of queue
Fig 1 A depth frame of “Ballet” video The point at location (x,y) is inputted from user as a feature of ROI The pixels in transition region between dancer’s legs and the floor have same depth values hence our flood-fill algorithm recognizes these two regions as ones
Fig 2 The sizes of queue Q in 10 continuous frames in
“Ballet” video
The flood-fill algorithm will stop to extract the ROI when we observe the second explosion of the size of queue
Q, and consequently the unexpected region could not be considered as ROI The size of queue Q in the same 10 continuous frames is shown in Fig 3
Fig 3 The development of queue Q’s size in 10 frames
in “Ballet” video when the cut-off method is applied
2.3 ROI tracking
In order to improve ROI extraction efficiency, ROI is tracked after being detected After the information of ROI
in the first frame is collected, it is necessary to keep track
of this ROI in subsequent frames since this information
Trang 4will be used for encoding ROI There are two methods
utilized to follow the movement of ROI
Firstly, an independent module was implemented to
estimate based motion vectors based on
block-matching algorithm The main idea of block-block-matching
technique is to divide each frame to so-called macroblocks
and find the best match block by comparing the absolute
difference of all pixels of two blocks This module worked
well but consumed much time because of executing many
complex calculations Another solution with low
computational complexity and high performance is to
reuse motion block-based motion vectors from H.264
encoder
The whole ROI is divided into macroblocks For a
macroblock at location ሺܽǡ ܾሻ which has motion vector of
ሺ݉ǡ ݊ሻ, its location at next frame is predicted as below:
ሺܽᇱǡ ܾԢሻ ൌ ሺܽǡ ܾሻ ሺ݉ǡ ݊ሻ (1)
Fig 4 Motion estimation of ROI
In the Fig 4, the frame is divided into macroblocks,
which are represented by green dots The man is moving
hence all macroblocks in that region have their own
motions When the ROI moves, in two continuous frames
of a video sequence, the object features such as shape,
color usually remain or change slightly Therefore, the
motion vector of the whole ROI region is estimated by
motions of all macroblocks in that ROI as below:
൫߱ഥ௫ǡ ߱ഥ௬൯ ൌσ ቀఠഥೣǡఠഥቁ
సభ
(2) where ݊ is the total number of macroblocks in ROI,
൫߱ഥ௫ǡ ߱ഥ௬൯ is the motion vector of ݅௧ macroblock To
represent the movement of ROI after each frame, we
estimate the movement of anchor point by the approximate
motion vector of ROI as the following formula:
൫ܽାଵǡܾାଵ൯ ൌ ሺܽǡ ܾሻ ൫߱ഥ௫ǡ ߱ഥ௬൯ (3)
in which ሺܽǡ ܾሻ is the location of anchor point at ݇௧
frame and ሺܽଵǡ ܾଵሻ is the inputted point at the first frame
2.4 Cumulative error
When ROI is tracked in video sequence by motion
vectors, there is a small error through each frame If these
errors are not solved after each frame, they can cause
cumulative error This error is repeated after a large
number of frames and obviously, causes the cumulative
error and hence ROI is tracked incorrectly At each frame,
after computing vector ߥҧ, we took an additional step to check whether the current object is the one we are looking for by comparing the depth value of anchor-point to the recorded average depth value of ROI in the previous frame with a threshold value from the experiment Then the object detection which uses flood fill algorithm is reapplied to extract ROI and anchor-point‘s location is defined by the average location of all pixels in ROI:
ሺݔǡ ݕሻ ൌσసభሺ௫ ǡ௬ሻ
(4)
in whichሺݔǡ ݕሻ is the current location of anchor point, ݊ is the total number of pixels in ROI
2.5 Occlusion problem
In ROI tracking process, there is a situation that the under tracked ROI suddenly disappears As can be seen from Fig 5, ROI is the man in white region who is moving behind another man In Fig 6, ROI almost disappears and, theoretically, system can no longer track this ROI In order
to improve the robustness of our approach, we predict the most likely object’s location when they reappear When an object disappears, our approach enters a special mode called Prediction Mode (PM) The basic idea of the PM rests on two assumptions:
a) Video object normally moves with a constant velocity In previous frames, the object’s velocity data is collected In fact, the velocity of the object is proportional
to the movement of the object through each frame, thus we use the vector ࣏ഥ to represent object velocity Basing on the velocity data, we predict the new position where it can
reappear
b) We only apply PM in a defined period of time From the observation, it can be seen that the probability ROI reappears is very small Hence, a threshold constant TIME_OUT = 3 (seconds) is defined After this period, we
TABLE I ROI DETECTION AND TRACKING RESULT
Moving ballet dancer
Indoor, stable camera, resolution ͳͲʹͶ ൈ ͺ
Detection: 80%
Tracking: good Moving
rectangular box
Indoor, stable camera, resolution
ͶͲ ൈ ͶͺͲ
Detection: 100%
Tracking: good
camera, resolution
ͶͲ ൈ ͶͺͲ
Detection: 100%
Tracking: good
camera, resolution
ͶͲ ൈ ͶͺͲ
Detection: 100%
Tracking: good Moving man
(disappears)
Indoor, stable camera, resolution
ͶͲ ൈ ͶͺͲ
Detection: 90%
Tracking: medium
98
Trang 5consider that ROI disappear permanently and stop
predicting its location if it does not reappear
Fig 5 ROI is moving into invisible area, behind the
standing man
Fig 6 ROI disappears
3 Experimental results
Several videos with different conditions (camera
position, resolution) and different kinds of ROI (moving
person, standing person, ball, rectangular box ) have been
tested Table 1 shows the result in details The detection
accuracy is assessed by capturing the size of queue Q As
mentioned above (ROI detection using depth frame),
frames that have one explosion of queue size or applied
successful cut-off method are considered as true detection
frames The result of detection and tracking is shown in
Table 1
4 Conclusion
A method for detecting and tracking ROI in 2D video
plus depth by using depth information is briefly presented
in this paper In our method, viewer can provide the prior
information of the interesting video regions by clicking or
touching on the display screen; flood-fill algorithm is
responsible to detect the ROI exactly Afterwards, motion
vectors of ROIs are extracted to predict ROI’s movement
We succeed in detecting the correct ROIs and tracking
them in subsequent frames which help not only increasing
the perceived quality of video in ROI region but also
reduce the bit rate of video coding
5 Acknowledgment
This work was supported by the basic research projects
in natural science in 2012 of the National Foundation for Science & Technology Development (Nafosted), Vietnam (102.01-2012.36, Coding and communication of multiview video plus depth for 3D Television Systems)
References
[1] Junwei Han, King N Ngan, Mingjing Li, and Hong-Jiang Zhang (2006, Jan.) Unsupervised extraction
of visual attention objects in color images IEEE Trans Circuits and Systems for Video Technology
[Online] 16(1), pp 141–145
[2] Yang Wang, Kia-Fock Loe, Tele Tan, and Jian-Kang
Wu (2005, Jul.) Spatiotemporal video segmentation
based on graphical models IEEE Trans Image Processing [Online] 14(7), pp 937–947
[3] Minghui Wang, Tianruo Zhang, Chen Liu, and Satoshi Goto (2009) Region-of-interest based H.264 encoding parameter allocation for low power video communication Presented at
5th International Colloquium on Signal Processing &
Its Applications (CSPA)
[4] L S Karlsson, M Sjӧstrӧm (2008, May.) Region-of-interest 3D video coding based on depth images
Presented at 3DTV Conference
[5] D V S X De Silva, W A C Fernando, and S L P
Yasakethu (2009, Aug.) Object based coding of the
depth maps for 3D video coding IEEE Trans
Consumer Electronics [Online] 55(3), pp 1699–
1706
[6] Yun Zhang, Mei Yu, Gang-Yi Jiang (2009, Jul.) Depth based region of interest extraction for multi-view video coding Presented at
2009 International conference on machine learning and cybernetics
[7] Tianruo Zhang, Chen Liu, Minghui Wang, Satoshi Goto (2009, Oct.) Region-of-interest based H.264 encoder for videophone with a hardware macroblock level face detector Presented at IEEE International workshop on multimedia signal processing (MMSP 2009), pp 1-6
[8] Yun Zhang, gangyi Jiang, Mei Yu, You Yang, Zongju Peng, Ken Chen (2010, Jul.) Object based coding of
the depth maps for 3D video coding, Journal of visual communication and image representation [Online] 21(5-6), pp 498-512