Luận văn region of interest 3d video coding base on depth images nén video 3d vùng trọng tâm dựa trên Ảnh chiều sâu luận văn ths công nghệ thông tin

In the proposed method, we first detect and extract ROI based on the depth information abtained from 20 'T'V video coding sequences.. A color frame of break-dancers video sequence Lecce

Trang 1

MASTER THESIS OF INFORMATION TECHNOLOGY

Hanoi - 2015

Trang 2

‘Major: Computer Science

Trang 3

Originality Statement

Thereby declare that this submission is my own work and ta the best af my know! edge it contains uo materiely previously published or writlan by auother person, or

substautial proportions of material which have beeu accepted for the award of any

other degrce or diploma at University of Engineering and Technology (UET/Cokcch)

or any other educational institution, except where due acknowledgment is made in the thesis, Any contribution made to the research hy others, with whom T have worked al UET/Collech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and Tingnistic expression is acknowledged

Date:

Sigued:

Trang 4

Abstract

Tue to chametenstics of human visual system, penple usually focus mone on a speerfic region waned Reyion-of-Intereal (ROI) of 4 video frome, ralher Han watel the whole frame, In addition, ROFbused video evding can also help to effectively

reduce the number of encoding bitrates required for video transmission over networks, especially for the JL 1'V transmissions Yherefore, in this work, we propase « novel RO-based bit allocation method which can adaptinely extract and inerense the visual qualily of ROT while saving a huge number of encoding bilrutes for video data In the proposed method, we first detect and extract ROI based on the depth information abtained from 20 'T'V video coding sequences ‘Then, based on the extracted OL,

a noned bit alla

on scheme is performed to solve the rate-distortion optimization

problem, in which the higher priority Vilrates are adaptively assigned lo ROL while the total encuding bitrates of vdeo frames arc kept satisfying all constraints required

by the HD aptimization Lxperimental results show that the proposed method can provide not only higher peak signal to-naise ratio performance but also save up to

11.0% £neading bitrates compared ta pther canaemtlianal methnds

Trang 5

Acknowledgements

First and foremast, T would like ta express my deepest gratitude to my supervi- gor, Dr, Le Thanh Hy, for his patient guidauce and continuous supports Uaroughout the years He always appears when I aeed Lelp, and respouds te queries su helpfully and promptly I also appreciate Dr Dinh Tricu Duong for giving me some useful advice | would like to give my honest appreciation to my best friends at University

of Engineering and Technology for whatsoever they did for me T sincerely acknowl edge Uhe Vieluum Natioual University Hanoi and especially, 102.01-2012.36 project {Coding and comnmunication of multiview video plus depth for 3D Television Sys-

tems’ for supporting finance to my master study L'inally, this thesis would not have

een possihle withont the support and love of my family Thank yon!

iii

Trang 6

2.1.1 Unsupervised detection method

2.1.2 Supervised detection method

22 NOL tracking

2.3 Rute control and ROI video eoding,

3 ROT detection and tracking

Trang 8

The general video tralsrHisgiun sySbemM cà cà

Conventional video compression

Example of bitrate fluetuation vs target bitrave 6

Framework of proposed ROI video coding method» 0

A color frame of break-dancers video sequence Lecce ee

A depth Leume of break-daucers video sequence

Mlustration of flonafill algorithm in ROT detection a

The ROT iy cxpauded surrounding llš border cuc

Color, depth and extracted ROT frames in ballet secuence

Color, depth and extracted ROI frames in break-dancers sequence

The floor ie misunderstand as ROT -

Thc correctly cxtreetcd RƠI bccn teenies

Spiral searching

PSNR comparison hetween color and depth video of Ballet sequence

PSNR comparison of Ballet sequence (view 0) with low target bitrate PSNR comparison of Ballet sequence (view 0)

PSNR of ROL comparison of Ballet sequence (view U) PSNR of ROT comparison of Rreak-dancers sequence (view 1)

vi

16

Trang 9

Results of ROI detection and tracking

QP values for all regions in specific canes Bitrate reduction comparison

Trang 10

List of abbreviations

3D 3-dimensional 1, 4, 5, 9, 25

dB Decibel 83 35

GOP Group Of Pictures 2, 8, 9

HEVC High Pificiency Video Coding 9

HVS Human Visnal System 1, 2, 29

4M Joint Model 10, 32, 34

IVT Joint Video Team 32

MB Macroblock 4, 8, 9, 11, 17, 22, 23, 25, 28-31

MSE Mean Square Brror 33

PSNR Peak Signal to Noise Ratio 10, 32-35

SAD Sum of Absohite Differences 22, 23

SRC Scalable Rule Control 25

vii

Trang 11

we listen, the movies we watch should be in higher quality For years, we arc all familiar with 2D movies and color televisions while 3-dimensional (4D) videos have just appeared recently From small serean of portable devices like smartphones or

tablets to bigger screen of our sinart home televisious ur (he cine, we can watch

3D videos with single view or multiview, which bring morc cxpcricnees of the real world than ever [1,6,10,12, 16,23] This change seems to be a revolution in science and technology, However, this evolution is a great technical challenge for those who are responsible to broadcast 3D videos, especially throngh the internet

To help the internet bandwidth adapt well Wo the high resvlution of 8D videos, the exponential growth in their sizes, it is neccasary to compress raw videos into smaller size videos for convenient transmission In contrast, the loss of video contents usually

cannot satisfy the audiences Region-of Interest (ROT} video coding is an efficient

approach to address this problem, it helps to increase the coding efficiency based

on the rescagch in visual perception [6,8,9,14,17,19 22] Normally, when watching

a video, in particular a video scene, it is impossible for us to focus on the whole screen with lots of information inchiding people, ohjects, hackgramd ‘The attention

of Toman Vistal System (ITVS) is mainly human faces, hodies of some impressive

Trang 12

1.1 Motivation 2

objects that located on the center of the frame, which is called ROI [8,17,19] Other

Figure 1.1; The general video transmission system

The general video transmission system is depicted in Figure 1.1, where the raw

video is encoded into bit information, transmitted, followed by decoding process

before displaying to the users Particularly, in conventional method of video coding,

as can be seen from Figure 1.2, the whole video frame is compressed with the same

coefficient, it means that the visual quality of all parts of the frame, including the

tiger, the stick and snow are the same On the contrary, ROI coding compresses

video with higher priority for ROIs than the background As a result, from Figure 1.3, only the quality degradation in non-ROI regions (the stick and snow) en

while high visual quality is observed in ROI, the tiger in this case

Unlike conventional coding methods, video ROI coding needs to detect and track

ROI through video sequence The accuracy of ROI detection and tracking processes

guarantees the efficiency of this coding method, in which video size is small enough

to be transmitted smoothly whilst perceived quality of ROI satisfies viewers In

encoding process, Rate Control (RC) plays an important role

In fact, if parameters of the video encoder such as Quantization Parameter (QP), Group Of Pictures (GOP), motion estimation, search range are not changed then

Trang 14

1.2 Proposed approach summary 4

the contents and details of video frame are the key components which affect the nmmber of output bitrate for enended viden The problem is that, hardly can the nelwork or the slorage adapt the video bilrate if it varies ia u wide rage Henec, she output bitstream of encoded video needs to be controlled to make sure that it can be transmitted conveniently through the network without losing packages In other words, the total bitrate produced should be close to and satisty the condition

of the network bandwidth as illustrated in Fignre 1.4

Method used 1o keup the target bitrate fluvtuating, slightly ezound a couslust by adjusting the coding parameters is called RC ‘The most popular approach of RC! is

to modify the QU

Tn this thesis, an interactive framework for ROT ending is proposed 3D video is represented by the combination of 2D color frames and depth frames, which consist

of depth information of objects in video Alterwards, ROIs ure deteuted and trucked based on this information ‘lia be more specific, users choose a region that they want

to focus on, the information will he sent to the server, after that, server detect and track ROI, apply ROI coding, Finally, a ROl-compressed video will be display in

users’ ss

9L

Thanks to the information provided from users, ROI is dewected and tracked

accurately Moreover, the problem in case of ROI disappears and then come back

another ROT

Besides, for coding ROI, a RC inedel is proposed, which is rosponsible tơ allocate bits for each Mecroblock (MB) and then calculate the QP to apply compression in all regions {includes ROI regions, transition region and non-ROT region) of video to balance between the intemer handwidth and the quality of compressed video

Trang 15

future works are discussed

Trang 16

Tn recent vears, the content-based video processing method is a popular approach alternative tv Lluck-based or pisel-based developed in some coding standards The

interesting or meaningful objects for viewers are called semantic visual information

or s more specific name: semantic video cbjects in J] 'here are many methods for

extracting semantic visital information, and they can be divided into two separate

groups: uusupervised aud supervised methods, Unsupervised methods usually find

it difficult to extract scmantie visual information becausc of its vague definition while the video scene may consists many objects at once Suppose that two different

algorithms ntilized to determine semantic video objects in the same video scene, the

results would be different far from each olher and do not meet tle expected results

of users On the contrary, few efforts taken by users helps the computer getting

easicr to detect the target objects in supervised extraction methods

Trang 17

2.1 TLOT detection 7

2.1.1 Unsupervised detection method

‘The unsupervised detection or full-automatic extraction performs the segmen tation withont any support from users, completely haaed on the characteristics of

objects

Depth map is gencrated cither by camera copturing or disparity matching based depth creation algorithm in [22] Foreground and background are segmented from depth map, then combined with motion extraction and texture contour extraction

ww extract ROT, aller the processes of analyzing histogram and eliminating, nuise

The whole extraction process is unsupervised However, this algorithm also defines both dynamic and static objects at the foreground are ItO1, while static objects are rarely interested by users | his leads ta the redundancy in bit allocation

Toth luminance and chrominance information of VIV format frames are used int [17] to extract ROL The saliency is created firstly by Y chanmel of the input and then, skin color detection is performed in U, V channels By the end, it applies the greedy algorithm in the saliency map hy lessening the area of chosen region +o extract ROT The objective function of this algorithm is both maximizing the saliency points and mininizing (he area of ROI

To model the attention objects, [4] used three attributes: attention value, edge set, and homogeneity measure which makes this method fairly compheated Due

to the complexity of algorithm of ROT detection, if is diffienlt to apply in portable

devices in [4) and [18]

There are two methods for ROI detection in 6] based on depth inmayes, wilh or

without skin-color detection The first method calculates the histogram of depth values to determine a depth threshold for ROI and all the points closer than the points at depth threshold will belong to ROT The latter method combines depth

information in the former one with skiu-colur detection for ROT detection These

two incthods ercate two different results and they arc not realistic duc to the fact that close regions are not most interesting objects for the viewers in many cases Unlike other researches, [2] distmgnishes two terms “object” and “region” ex-

plicitly, which then helps in the segmentation of the video Tn the segmentation

process, the computer extracts a number of regions based on multiple features like color components, displacement values, position values and texture information Several regions of them combined to form an object in the semantic step by users’ selection via a simple graphical user interface The segmentation results seem to he

Trang 18

2.2, TOT tracking 8

accurate, however, the objects formation is very complicated and time-consuming

As a consequence, this method is not suitable for real time application

In [14], w frame is divided into 8 type of regions: the firs boundary, the scound boundary and the central region ‘his division, combined with the MAD estimation for each MI help to determine 5 types of ML in the descending order of importance

to the LIVS: central MB, expanded MB, ordinary MB, lower MB and lowest MB In this method, the central region is considered as ROT with higher probability than others, This ncthod scons to be inellicicnt when the ROT iy located far Lom thề

center, the region that users focus on will have lower quality than others

2.1.2 Supervised detection method

The supervised detevtion incthod cu be fallen into twe suller categories: ume ual and semi-automatic segmentation Menual segmentation approach is a time consuming process because it requires users to determine the borders of wanted regions in many frames Hence, this method is not snitable for online transmission Meanwhile, semi-wutomutic scymicutalion method also uceds the support fro users

by identifying regions’ borders in one frame ‘I'he tracking process of this approach

is completely automatic

The contour-based method used in [J requires users to draw the boundary of their objects This approach takes a tremendous effort from sera, especially when the object is polygouul or nou-spucilie shuped Moreover, the clussilivation for such objects also requires an algorithm with high complexity when it takes many steps to find out the anticipated regions, not to mention the accuracy of objects extraction

Since the method af segmentation and tracking in [3] is semiantomatic and its

seyneutation process needs the wssistanoe of users, the “racking process in this

research is unsupervised and completcly automatic, containing 4 iterative steps: smotion prediction, motion estimation, boundary warping and boundary adjustment

(20, 21] track ROT hy beth temporal and spatial methods which means their imechauisin inherits infurmation [rom not only preveding frames in the same GOP but also the neighboring views Only the first frame of the central view is utilized

to detect ROI in this method.

Trang 19

2.3 late control and ROT viden coding 9

In all video encoders, HC is a mandatory part due to its utilization is achieving the source bit rate which meets the requirement of the bandwidth and minimizing the rate distortion There are many researches ahont RC so far, including RC for texture video, RC lor 8D video R-Acbased RC is used in 9] for Uhe cutting edge video encoder High Efiiciency Video Coding (HEVC) at three main levels: GOP, frame and coding unit with glaroi-based model scheme They propose a constant is

the ratio between bitrate used for 11 and background ‘Lhe wide range variation

of creates a, significant gap in visnal quality hetwean ROT and backgroud,

The ROLbusced RC seboune in [14,19 ndopts the clussic Rute Quantization (RQ) model with a slight customization In [9, tliey allocate more bits to encode the frame with more complexity ‘I'he QU’ for 5 types of MB have Lunit increment in this order: central MB expanded MB ordinary MB, lower MR and lewest MR The complicated split of MB types aud Use reullocution bil [or frataes inerease dhe complexity of bit allocation in RC model

There is a bit budget allocated for ROL and the remaining bits spent for background since the ROT encoding proness is performed priar to other regions in [19 ROI

iy flexibly raukod fu several interest levels acvording to users’ requirement with the tujuimum level of ROT is considered us the buckyround QP of ROI Lies iat a specific interval and will be chosen depended on the actual need, In 13", frames with strong local motion attention sre assigned more bits than other frames in the same GOV while in each frame, the more visnally significant, MBe are, the more hits they are allocated

In 3D video coding, many RC models are proposed, each one needs to solve several problems: bit allocation between color and depth view, bit allocation in view level and frame level QP is selected in the final stage for frame or MB level, depeuded on the nvcd of vodiug A syuthesized view distortion model is presented

in 12,16] to establish the joint bit allocation A convex optimization problem is solved in [12] in order to determine the bitrate between texture and depth ‘I'his ratio

is adaptively changed, with texture video usually uses more bit than depth video except for the cases which the hackground regions are simple, stationary and the depth maps ure not vinvoth, [5,12] ostimates the distortion betweou the synthesized view and the the coded views of both texture video and depth maps by using Rate Distortion (11D) model In [LY], after determining the optimal target bit rate between

Trang 20

2.3 late control and ROT viden coding 10

ww 24.41% of uveruge bilrate lor different videu sequeaces while the Puuk Sigual

to Noise Ratio (PSNH) of ROL is almost unchanged Correspondingly, AQP = 4

in [17] ‘his method focuses on reducing the encoding complexity when the video quality is still acceptable

I builds w protovol which diversifies the QP for both depth and culer views from 20 to 44 In various test sequences, the PSNR, of reconstructed videos reaches the peak when the bitrate spent for depth sequence fluctuates from 40% to 50% of the total bitrate for the whole 2D pins depth video sequence When the depth information is inadequate, Ue coding, procuss fully in ubjects segmentation und eausus the crrom in yyntlesived view Couwersdly, if Lhe depth vecupies almost the total coding bitrate, it degrades the texture information significantly

[3] proposes both [NOL based rate control for L1.264/AVC and ROL based com putational power allocation which helps saving encoding time up to 72% and makes Uhig cueuder being suitable for real bine couversutivaal coiuraunioutiou or portable devices The visual quslity improves despite PSNR is a little bit lower then the original reference software (Joint Model (JM) 9.8)

Trang 21

Chapter 3

ROI detection and tracking

In this chapter, a framework of ROI video coding is presented In order to encode video with ROT, it is necessary to detect the position of ROT in the first frame and urack ROT iu subsequent frames In this framework, viewers are uble to choose their ROIs in the form of touching or clicking on the video display screen Afterwards, the touched location is recorded and signalled to the server Given an initial touching location, the video encoder at the server detects and tracks NOle basing on both color and depth information obtained from video plus depth sequences According

ww the anulysis of different metbeds in the previous chupter, including boll their advantages and disadvantages, our proposed method is a semi-automatic approach However, unlike other methods, we only need human operator by touching the ROI

as presented im |IÚ|, ins

ead of identifying its houndary

Specifieully, us can be soon from Figure 3.1, at the very first step of proposed ROI coding method, users can interactively choose the regions they want to focus on the most In spite of the fact that this detection method is apparently more complicated

for users than other methods, the information of ROT is completely exact what the

users want The central pixel of Uhe suaall regiou thet users touch om the sercen is called anchor point, ity vuordinute is Chen seut to the server, The cneoder al the

server is responsible for detecting HOI by utilizing the aforementioned location of anchor point (Section 3.1), After the ROI is initialized, the server keeps tracking this region in subsequent frames throughout its movement: until it no longer appears

ou the serecu (Section 3.2), Based ou the lovution of it, the MBs which cover ROI will be encoded with higher quality than those ones of other regions (Chapter 4) The process of haw many bits each MD js allocated in all regions, and then, the

Trang 23

3.1 ROI detection 13

Figure 3.2: A color frame of break-dancers video sequence

QP selection are presented in Section 4.1 and Section 4.2, respectively F

these s

lowing

ps, the whole content of compressed video is sent back to the users’ device

for decoding Users would watch the video which is applied ROI video coding with

the quality that suitable with their bandwidth capacity In due course, views rs can

choose other regions as their ROI and those above steps of ROI video coding will

Trang 24

darkest and the brightest color, respectivel

break-dancers video sequence and Figure 3.3 shows its appropriate depth map

3.1.1 ROI region extraction

Normally, two points belong to the same object have the same or approximate

depth values

From the anchor point which is inputted by users, ROT is determined

by using flood-fill algorithm ‘The basic idea of this algorithm is to visit pixels in the frame, beginning from anchor point We use a queue Q to store visiting pixels

From a pixel of ROI, the algorithm expands its 4 directions to get the information of

4 pixels, The location of these neighboring points will be euquened to Q Elements

in quete Q are sequentially dequeued to check whether they belong to ROL or not The condition for this step is very simple, a pixel belongs to ROI if and only

if the difference in depth value between this point and its adjacent point (which is previonsly determined as a point of ROL) is less than or equal to a defined threshold

We empirically set this threshold value to 3 After being determined as a part of

ROI, a pixel will be continuously expanded to directions to discover other 4 points

except the ones which have already been visited or they are still in quene Q In order

Trang 25

3.1, TOT detection 1ã

Algorithm 1 ROI detection using floodiill algorithm

1: procedure DETECTROI (frame, anchor}

2n procedure checkROI( Point p)

as iff [pdepth{) — ref Depth| < threshold then

Aa, vef Depth — P.dequene()

38 check ROI (current)

return

Tiêu đề	Region-Of-Interest 3D Video Coding Based On Depth Images
Tác giả	Pham Thanh Nam
Người hướng dẫn	Dr. Le Thanh Ha
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Technology
Thể loại	Thesis
Năm xuất bản	2015
Thành phố	Hanoi

Định dạng
Số trang	50
Dung lượng	607,94 KB