In the proposed method, we first detect and extract ROI based on the depth information abtained from 20 'T'V video coding sequences.. A color frame of break-dancers video sequence Lecce
Trang 1MASTER THESIS OF INFORMATION TECHNOLOGY
Hanoi - 2015
Trang 2
‘Major: Computer Science
Trang 3Originality Statement
Thereby declare that this submission is my own work and ta the best af my know! edge it contains uo materiely previously published or writlan by auother person, or
substautial proportions of material which have beeu accepted for the award of any
other degrce or diploma at University of Engineering and Technology (UET/Cokcch)
or any other educational institution, except where due acknowledgment is made in the thesis, Any contribution made to the research hy others, with whom T have worked al UET/Collech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and Tingnistic expression is acknowledged
Date:
Sigued:
Trang 4Abstract
Tue to chametenstics of human visual system, penple usually focus mone on a speerfic region waned Reyion-of-Intereal (ROI) of 4 video frome, ralher Han watel the whole frame, In addition, ROFbused video evding can also help to effectively
reduce the number of encoding bitrates required for video transmission over networks, especially for the JL 1'V transmissions Yherefore, in this work, we propase « novel RO-based bit allocation method which can adaptinely extract and inerense the visual qualily of ROT while saving a huge number of encoding bilrutes for video data In the proposed method, we first detect and extract ROI based on the depth information abtained from 20 'T'V video coding sequences ‘Then, based on the extracted OL,
a noned bit alla
on scheme is performed to solve the rate-distortion optimization
problem, in which the higher priority Vilrates are adaptively assigned lo ROL while the total encuding bitrates of vdeo frames arc kept satisfying all constraints required
by the HD aptimization Lxperimental results show that the proposed method can provide not only higher peak signal to-naise ratio performance but also save up to
11.0% £neading bitrates compared ta pther canaemtlianal methnds
Trang 5Acknowledgements
First and foremast, T would like ta express my deepest gratitude to my supervi- gor, Dr, Le Thanh Hy, for his patient guidauce and continuous supports Uaroughout the years He always appears when I aeed Lelp, and respouds te queries su helpfully and promptly I also appreciate Dr Dinh Tricu Duong for giving me some useful advice | would like to give my honest appreciation to my best friends at University
of Engineering and Technology for whatsoever they did for me T sincerely acknowl edge Uhe Vieluum Natioual University Hanoi and especially, 102.01-2012.36 project {Coding and comnmunication of multiview video plus depth for 3D Television Sys-
tems’ for supporting finance to my master study L'inally, this thesis would not have
een possihle withont the support and love of my family Thank yon!
iii
Trang 62.1.1 Unsupervised detection method
2.1.2 Supervised detection method
22 NOL tracking
2.3 Rute control and ROI video eoding,
3 ROT detection and tracking
Trang 8The general video tralsrHisgiun sySbemM cà cà
Conventional video compression
Example of bitrate fluetuation vs target bitrave 6
Framework of proposed ROI video coding method» 0
A color frame of break-dancers video sequence Lecce ee
A depth Leume of break-daucers video sequence
Mlustration of flonafill algorithm in ROT detection a
The ROT iy cxpauded surrounding llš border cuc
Color, depth and extracted ROT frames in ballet secuence
Color, depth and extracted ROI frames in break-dancers sequence
The floor ie misunderstand as ROT -
Thc correctly cxtreetcd RƠI bccn teenies
Spiral searching
PSNR comparison hetween color and depth video of Ballet sequence
PSNR comparison of Ballet sequence (view 0) with low target bitrate PSNR comparison of Ballet sequence (view 0)
PSNR of ROL comparison of Ballet sequence (view U) PSNR of ROT comparison of Rreak-dancers sequence (view 1)
vi
16
Trang 9Results of ROI detection and tracking
QP values for all regions in specific canes Bitrate reduction comparison
Trang 10List of abbreviations
3D 3-dimensional 1, 4, 5, 9, 25
dB Decibel 83 35
GOP Group Of Pictures 2, 8, 9
HEVC High Pificiency Video Coding 9
HVS Human Visnal System 1, 2, 29
4M Joint Model 10, 32, 34
IVT Joint Video Team 32
MB Macroblock 4, 8, 9, 11, 17, 22, 23, 25, 28-31
MSE Mean Square Brror 33
PSNR Peak Signal to Noise Ratio 10, 32-35
SAD Sum of Absohite Differences 22, 23
SRC Scalable Rule Control 25
vii
Trang 11we listen, the movies we watch should be in higher quality For years, we arc all familiar with 2D movies and color televisions while 3-dimensional (4D) videos have just appeared recently From small serean of portable devices like smartphones or
tablets to bigger screen of our sinart home televisious ur (he cine, we can watch
3D videos with single view or multiview, which bring morc cxpcricnees of the real world than ever [1,6,10,12, 16,23] This change seems to be a revolution in science and technology, However, this evolution is a great technical challenge for those who are responsible to broadcast 3D videos, especially throngh the internet
To help the internet bandwidth adapt well Wo the high resvlution of 8D videos, the exponential growth in their sizes, it is neccasary to compress raw videos into smaller size videos for convenient transmission In contrast, the loss of video contents usually
cannot satisfy the audiences Region-of Interest (ROT} video coding is an efficient
approach to address this problem, it helps to increase the coding efficiency based
on the rescagch in visual perception [6,8,9,14,17,19 22] Normally, when watching
a video, in particular a video scene, it is impossible for us to focus on the whole screen with lots of information inchiding people, ohjects, hackgramd ‘The attention
of Toman Vistal System (ITVS) is mainly human faces, hodies of some impressive
Trang 121.1 Motivation 2
objects that located on the center of the frame, which is called ROI [8,17,19] Other
Figure 1.1; The general video transmission system
The general video transmission system is depicted in Figure 1.1, where the raw
video is encoded into bit information, transmitted, followed by decoding process
before displaying to the users Particularly, in conventional method of video coding,
as can be seen from Figure 1.2, the whole video frame is compressed with the same
coefficient, it means that the visual quality of all parts of the frame, including the
tiger, the stick and snow are the same On the contrary, ROI coding compresses
video with higher priority for ROIs than the background As a result, from Figure 1.3, only the quality degradation in non-ROI regions (the stick and snow) en
while high visual quality is observed in ROI, the tiger in this case
Unlike conventional coding methods, video ROI coding needs to detect and track
ROI through video sequence The accuracy of ROI detection and tracking processes
guarantees the efficiency of this coding method, in which video size is small enough
to be transmitted smoothly whilst perceived quality of ROI satisfies viewers In
encoding process, Rate Control (RC) plays an important role
In fact, if parameters of the video encoder such as Quantization Parameter (QP), Group Of Pictures (GOP), motion estimation, search range are not changed then
Trang 141.2 Proposed approach summary 4
the contents and details of video frame are the key components which affect the nmmber of output bitrate for enended viden The problem is that, hardly can the nelwork or the slorage adapt the video bilrate if it varies ia u wide rage Henec, she output bitstream of encoded video needs to be controlled to make sure that it can be transmitted conveniently through the network without losing packages In other words, the total bitrate produced should be close to and satisty the condition
of the network bandwidth as illustrated in Fignre 1.4
Method used 1o keup the target bitrate fluvtuating, slightly ezound a couslust by adjusting the coding parameters is called RC ‘The most popular approach of RC! is
to modify the QU
Tn this thesis, an interactive framework for ROT ending is proposed 3D video is represented by the combination of 2D color frames and depth frames, which consist
of depth information of objects in video Alterwards, ROIs ure deteuted and trucked based on this information ‘lia be more specific, users choose a region that they want
to focus on, the information will he sent to the server, after that, server detect and track ROI, apply ROI coding, Finally, a ROl-compressed video will be display in
users’ ss
9L
Thanks to the information provided from users, ROI is dewected and tracked
accurately Moreover, the problem in case of ROI disappears and then come back
another ROT
Besides, for coding ROI, a RC inedel is proposed, which is rosponsible tơ allocate bits for each Mecroblock (MB) and then calculate the QP to apply compression in all regions {includes ROI regions, transition region and non-ROT region) of video to balance between the intemer handwidth and the quality of compressed video
Trang 15future works are discussed
Trang 16Tn recent vears, the content-based video processing method is a popular approach alternative tv Lluck-based or pisel-based developed in some coding standards The
interesting or meaningful objects for viewers are called semantic visual information
or s more specific name: semantic video cbjects in J] 'here are many methods for
extracting semantic visital information, and they can be divided into two separate
groups: uusupervised aud supervised methods, Unsupervised methods usually find
it difficult to extract scmantie visual information becausc of its vague definition while the video scene may consists many objects at once Suppose that two different
algorithms ntilized to determine semantic video objects in the same video scene, the
results would be different far from each olher and do not meet tle expected results
of users On the contrary, few efforts taken by users helps the computer getting
easicr to detect the target objects in supervised extraction methods
Trang 172.1 TLOT detection 7
2.1.1 Unsupervised detection method
‘The unsupervised detection or full-automatic extraction performs the segmen tation withont any support from users, completely haaed on the characteristics of
objects
Depth map is gencrated cither by camera copturing or disparity matching based depth creation algorithm in [22] Foreground and background are segmented from depth map, then combined with motion extraction and texture contour extraction
ww extract ROT, aller the processes of analyzing histogram and eliminating, nuise
The whole extraction process is unsupervised However, this algorithm also defines both dynamic and static objects at the foreground are ItO1, while static objects are rarely interested by users | his leads ta the redundancy in bit allocation
Toth luminance and chrominance information of VIV format frames are used int [17] to extract ROL The saliency is created firstly by Y chanmel of the input and then, skin color detection is performed in U, V channels By the end, it applies the greedy algorithm in the saliency map hy lessening the area of chosen region +o extract ROT The objective function of this algorithm is both maximizing the saliency points and mininizing (he area of ROI
To model the attention objects, [4] used three attributes: attention value, edge set, and homogeneity measure which makes this method fairly compheated Due
to the complexity of algorithm of ROT detection, if is diffienlt to apply in portable
devices in [4) and [18]
There are two methods for ROI detection in 6] based on depth inmayes, wilh or
without skin-color detection The first method calculates the histogram of depth values to determine a depth threshold for ROI and all the points closer than the points at depth threshold will belong to ROT The latter method combines depth
information in the former one with skiu-colur detection for ROT detection These
two incthods ercate two different results and they arc not realistic duc to the fact that close regions are not most interesting objects for the viewers in many cases Unlike other researches, [2] distmgnishes two terms “object” and “region” ex-
plicitly, which then helps in the segmentation of the video Tn the segmentation
process, the computer extracts a number of regions based on multiple features like color components, displacement values, position values and texture information Several regions of them combined to form an object in the semantic step by users’ selection via a simple graphical user interface The segmentation results seem to he
Trang 182.2, TOT tracking 8
accurate, however, the objects formation is very complicated and time-consuming
As a consequence, this method is not suitable for real time application
In [14], w frame is divided into 8 type of regions: the firs boundary, the scound boundary and the central region ‘his division, combined with the MAD estimation for each MI help to determine 5 types of ML in the descending order of importance
to the LIVS: central MB, expanded MB, ordinary MB, lower MB and lowest MB In this method, the central region is considered as ROT with higher probability than others, This ncthod scons to be inellicicnt when the ROT iy located far Lom thề
center, the region that users focus on will have lower quality than others
2.1.2 Supervised detection method
The supervised detevtion incthod cu be fallen into twe suller categories: ume ual and semi-automatic segmentation Menual segmentation approach is a time consuming process because it requires users to determine the borders of wanted regions in many frames Hence, this method is not snitable for online transmission Meanwhile, semi-wutomutic scymicutalion method also uceds the support fro users
by identifying regions’ borders in one frame ‘I'he tracking process of this approach
is completely automatic
The contour-based method used in [J requires users to draw the boundary of their objects This approach takes a tremendous effort from sera, especially when the object is polygouul or nou-spucilie shuped Moreover, the clussilivation for such objects also requires an algorithm with high complexity when it takes many steps to find out the anticipated regions, not to mention the accuracy of objects extraction
Since the method af segmentation and tracking in [3] is semiantomatic and its
seyneutation process needs the wssistanoe of users, the “racking process in this
research is unsupervised and completcly automatic, containing 4 iterative steps: smotion prediction, motion estimation, boundary warping and boundary adjustment
(20, 21] track ROT hy beth temporal and spatial methods which means their imechauisin inherits infurmation [rom not only preveding frames in the same GOP but also the neighboring views Only the first frame of the central view is utilized
to detect ROI in this method.
Trang 192.3 late control and ROT viden coding 9
In all video encoders, HC is a mandatory part due to its utilization is achieving the source bit rate which meets the requirement of the bandwidth and minimizing the rate distortion There are many researches ahont RC so far, including RC for texture video, RC lor 8D video R-Acbased RC is used in 9] for Uhe cutting edge video encoder High Efiiciency Video Coding (HEVC) at three main levels: GOP, frame and coding unit with glaroi-based model scheme They propose a constant is
the ratio between bitrate used for 11 and background ‘Lhe wide range variation
of creates a, significant gap in visnal quality hetwean ROT and backgroud,
The ROLbusced RC seboune in [14,19 ndopts the clussic Rute Quantization (RQ) model with a slight customization In [9, tliey allocate more bits to encode the frame with more complexity ‘I'he QU’ for 5 types of MB have Lunit increment in this order: central MB expanded MB ordinary MB, lower MR and lewest MR The complicated split of MB types aud Use reullocution bil [or frataes inerease dhe complexity of bit allocation in RC model
There is a bit budget allocated for ROL and the remaining bits spent for back- ground since the ROT encoding proness is performed priar to other regions in [19 ROI
iy flexibly raukod fu several interest levels acvording to users’ requirement with the tujuimum level of ROT is considered us the buckyround QP of ROI Lies iat a specific interval and will be chosen depended on the actual need, In 13", frames with strong local motion attention sre assigned more bits than other frames in the same GOV while in each frame, the more visnally significant, MBe are, the more hits they are allocated
In 3D video coding, many RC models are proposed, each one needs to solve several problems: bit allocation between color and depth view, bit allocation in view level and frame level QP is selected in the final stage for frame or MB level, depeuded on the nvcd of vodiug A syuthesized view distortion model is presented
in 12,16] to establish the joint bit allocation A convex optimization problem is solved in [12] in order to determine the bitrate between texture and depth ‘I'his ratio
is adaptively changed, with texture video usually uses more bit than depth video except for the cases which the hackground regions are simple, stationary and the depth maps ure not vinvoth, [5,12] ostimates the distortion betweou the synthesized view and the the coded views of both texture video and depth maps by using Rate Distortion (11D) model In [LY], after determining the optimal target bit rate between
Trang 202.3 late control and ROT viden coding 10
ww 24.41% of uveruge bilrate lor different videu sequeaces while the Puuk Sigual
to Noise Ratio (PSNH) of ROL is almost unchanged Correspondingly, AQP = 4
in [17] ‘his method focuses on reducing the encoding complexity when the video quality is still acceptable
I builds w protovol which diversifies the QP for both depth and culer views from 20 to 44 In various test sequences, the PSNR, of reconstructed videos reaches the peak when the bitrate spent for depth sequence fluctuates from 40% to 50% of the total bitrate for the whole 2D pins depth video sequence When the depth in- formation is inadequate, Ue coding, procuss fully in ubjects segmentation und eausus the crrom in yyntlesived view Couwersdly, if Lhe depth vecupies almost the total coding bitrate, it degrades the texture information significantly
[3] proposes both [NOL based rate control for L1.264/AVC and ROL based com putational power allocation which helps saving encoding time up to 72% and makes Uhig cueuder being suitable for real bine couversutivaal coiuraunioutiou or portable devices The visual quslity improves despite PSNR is a little bit lower then the original reference software (Joint Model (JM) 9.8)
Trang 21Chapter 3
ROI detection and tracking
In this chapter, a framework of ROI video coding is presented In order to encode video with ROT, it is necessary to detect the position of ROT in the first frame and urack ROT iu subsequent frames In this framework, viewers are uble to choose their ROIs in the form of touching or clicking on the video display screen Afterwards, the touched location is recorded and signalled to the server Given an initial touching location, the video encoder at the server detects and tracks NOle basing on both color and depth information obtained from video plus depth sequences According
ww the anulysis of different metbeds in the previous chupter, including boll their advantages and disadvantages, our proposed method is a semi-automatic approach However, unlike other methods, we only need human operator by touching the ROI
as presented im |IÚ|, ins
ead of identifying its houndary
Specifieully, us can be soon from Figure 3.1, at the very first step of proposed ROI coding method, users can interactively choose the regions they want to focus on the most In spite of the fact that this detection method is apparently more complicated
for users than other methods, the information of ROT is completely exact what the
users want The central pixel of Uhe suaall regiou thet users touch om the sercen is called anchor point, ity vuordinute is Chen seut to the server, The cneoder al the
server is responsible for detecting HOI by utilizing the aforementioned location of anchor point (Section 3.1), After the ROI is initialized, the server keeps tracking this region in subsequent frames throughout its movement: until it no longer appears
ou the serecu (Section 3.2), Based ou the lovution of it, the MBs which cover ROI will be encoded with higher quality than those ones of other regions (Chapter 4) The process of haw many bits each MD js allocated in all regions, and then, the
Trang 233.1 ROI detection 13
Figure 3.2: A color frame of break-dancers video sequence
QP selection are presented in Section 4.1 and Section 4.2, respectively F
these s
lowing
ps, the whole content of compressed video is sent back to the users’ device
for decoding Users would watch the video which is applied ROI video coding with
the quality that suitable with their bandwidth capacity In due course, views rs can
choose other regions as their ROI and those above steps of ROI video coding will
Trang 24darkest and the brightest color, respectivel
break-dancers video sequence and Figure 3.3 shows its appropriate depth map
3.1.1 ROI region extraction
Normally, two points belong to the same object have the same or approximate
depth values
From the anchor point which is inputted by users, ROT is determined
by using flood-fill algorithm ‘The basic idea of this algorithm is to visit pixels in the frame, beginning from anchor point We use a queue Q to store visiting pixels
From a pixel of ROI, the algorithm expands its 4 directions to get the information of
4 pixels, The location of these neighboring points will be euquened to Q Elements
in quete Q are sequentially dequeued to check whether they belong to ROL or not The condition for this step is very simple, a pixel belongs to ROI if and only
if the difference in depth value between this point and its adjacent point (which is previonsly determined as a point of ROL) is less than or equal to a defined threshold
We empirically set this threshold value to 3 After being determined as a part of
ROI, a pixel will be continuously expanded to directions to discover other 4 points
except the ones which have already been visited or they are still in quene Q In order
Trang 253.1, TOT detection 1ã
Algorithm 1 ROI detection using floodiill algorithm
1: procedure DETECTROI (frame, anchor}
2n procedure checkROI( Point p)
as iff [pdepth{) — ref Depth| < threshold then
Aa, vef Depth — P.dequene()
38 check ROI (current)
return