Context based visual object segmentation 1

We ﬁrst proposed a detection based method that formulates the segmenta-tion task as pursuing the optimal latent mask in a nonparametric manner inside thepredicted bounding box via sparse

Trang 1

Context-based Visual Object Segmentation

Wei Xia

(B Eng, Huazhong University of Science and Technology)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF

PHILOSOPHY

Department of Electrical and Computer Engineering

National University of Singapore

2014

Trang 3

Throughout the four years of my PhD study, there are a lot of people to thank forthe help and support they have provided First and foremost, I’d like to express mygreat gratitude to my two supervisors, Prof Loong Fah Cheong and Prof ShuichengYan Specifically, in the first semester, Prof Cheong helped me to lay a solid theo-retic foundation and find my research interest from a wide range of topics in the field

of computer vision and machine learning Then under the patient guidance of ProfYan, I managed to finish some work in object semantic segmentation, which formsthe main body of the thesis I enjoyed working with them, their passion and pro-fessionalism in research, dedication to the details, complete commitment and greatpersonality have significantly inspired me and will keep benefit me in my future life.Then I would like to express my thanks to my seniors Ju Sun and Jiashi Fengfor their patient guidance when I was struggling at the beginning of my PhD study.Special thanks also goes to Dr Csaba Domokos for his great professionalism andperfectionism, helping me win the PASCAL VOC Challenge and publish top-tieredpapers I learned a lot from the great experience of collaboration with him I alsowant to thank Jian Dong and Junshi Huang, who are both my room-mates andlab-mates and provided me a lot of help in both academic and life I will alwaysremember the days when we discussed till late night Besides, I met a lot of greatfriends here in Learning and Vision Lab, Qiang Chen, Zheng Song, Luoqi Liu, Min

Lin, Si Liu, Mengdi Xu, etc Furthermore, I’d like to express my sincere gratitude to

Mr Zhongyang Huang, for providing me the opportunity of internship in PanasonicSingapore Laboratory Under his guidance, I learned a lot about industrial research

Trang 4

from the interesting projects we have done together.

Last but not least, I want to thank my parents for their everlasting support andcare Finally I want to express my appreciation to my wife, Chong Chen Withouther love, companion and encouragement during the diﬃcult times, I would have not

be able to achieve this goal along this long journey of PhD study This thesis isdedicated to her

Trang 5

1.1 Historical Background 18

1.1.1 Image Classiﬁcation 19

1.1.2 Object Detection 20

1.1.3 Image Segmentation 21

1.1.4 Semantic Segmentation 22

Bottom-up Approaches 22

Top-down Approaches 23

Integrative Approaches 24

1.1.5 Obtaining Contextual Information 25

1.2 Thesis Focus and Contributions 25

1.3 Organization of the thesis 27

1.3.1 Relevant Publications 28

2 Segmentation over Detection via Optimal Sparse Reconstructions 29 2.1 Introduction 30

2.1.1 Motivation and Contributions 31

2.1.2 Related Work 32

2.2 Proposed Solution 34

2.2.1 Figure-ground Segmentation 35

2.2.2 Coupled Global and Local Reconstruction 38

2.3 Optimization Procedure 40

Trang 6

2.3.1 Optimization with respect to x and ˜ x1, , ˜x r 40

Sub-problem 1: x 40

Sub-problem 2: ˜x1, , ˜x r 41

2.3.2 Optimization with respect to m 41

2.4 Numerical Implementation 45

2.5 Experiments 48

2.5.1 Convergence Analysis 49

2.5.2 Eﬀects of the Size of Local Patches and Super-pixels 50

2.5.3 Eﬀects of the Detection Results 51

2.5.4 Parameter Estimation 52

2.5.5 Proof-of-Concept Experiments 53

2.5.6 Comparison on the PASCAL VOC Datasets 54

Performance gain from mask reﬁnement 55

Comparison on VOC’10, VOC’11 and VOC’12 56

2.5.7 Comparison on the Weizmann-Horses Dataset 59

2.6 Chapter Summary 62

3 Semantic Segmentation without Annotating Segments 64 3.1 Introduction 64

3.2 Related Work 66

3.3.1 Bounding Box Score Normalization 68

3.3.2 Object Shape Guidance Estimation 69

3.3.3 Graph-cut Based Segmentation 71

3.3.4 Merging and Post-processing 74

3.4 Experimental Results 74

3.4.1 Proof of the Concept 75

3.4.2 Comparison with the State-of-the-arts 77

Trang 7

4 Background Context Augmented Hypothesis Graph for Object

4.1 Introduction 85

4.2 Related Work 87

4.3.1 CRF-based Formulation 91

4.3.2 Background Context Modeling 93

4.3.3 Merging and Post-processing 95

4.4 Implementation Details 97

4.5 Experimental Results 99

4.5.1 Proof-of-concept Experiments 99

Eﬀects of the sub-category numbert 100

Eﬀects of the post-processing parameters 100

Eﬀects of the CRF model 102

Eﬀects of diﬀerent contextual cues 102

4.5.2 Comparison with the State-of-the-arts 105

VOC 2012 105

MSRC-21 dataset 108

5 Conclusion and Future Work 112 5.1 Thesis Conclusion 112

5.2 Discussion of the Future Directions 114

Trang 8

Visual object recognition is one of the most fundamental problems in artificial ligence, which mainly divides into three different tasks: object classification, objectdetection and object segmentation Classification tells what object the image con-tains; detection predicts the bounding box location of the object, while segmentationtends to assign category labels from a predefined label set to every pixel in the im-age In this thesis, we aim to solve the problem of object segmentation It hasbeen proved that the three tasks are significantly correlated that both classificationand detection can provide useful contextual information to guide the segmentationprocess We first proposed a detection based method that formulates the segmenta-tion task as pursuing the optimal latent mask in a nonparametric manner inside thepredicted bounding box via sparse reconstruction of the ground-truth masks overthe training set By taking into both the global and local constraints, a coupledconvex optimization framework is proposed By alternatively optimizing the sparsereconstruction coefficients and the latent optimal mask using Lasso and AcceleratedProximal Gradient methods, global optimal solution could be achieved

intel-Furthermore, since ground-truth segment annotation is generally very diﬃcult

to obtain while object bounding boxes can be obtained in a much easier way Weproposed a segmentation approach based on detected bounding boxes without anyadditional segment annotation from either the training set or user interaction Based

on a set of segment hypothesis, a simple voting scheme is introduced to estimate theshape guidance for each bounding box The derived shape guidance is used in thesubsequent graph-cut-based ﬁgure-ground segmentation and the ﬁnal segmentation

Trang 9

result is obtained by merging the segmentation results in the bounding boxes.Finally, inspired by the significant role of the context information, besides globalclassification and detection, we explore the contextual cues from the unlabeled back-ground regions that are usually ignored A fully connected CRF model is consideredover a set of overlapping hypothesis from CPMC, and the background contextualcues are learned from the unlabeled background regions and applied in the unaryterms of the corresponding foreground regions The final segmentation result is ob-tained via maximum-a-posteriori (MAP) inference, where the segments are mergedbased on a sequential aggregation manner Note that the proposed model has stronggeneralization ability, other contextual cues like global classification and detectioncan be easily integrated into the model to further boost the performance In order

to evaluate the eﬀectiveness of the proposed algorithms, extensive experiments areconducted on various benchmark datasets, ranging from the challenging PASCALVOC, to Weizmann Horse dataset, Grabcut-50, MSRC-21 dataset, etc The pro-posed approaches achieve new state-of-the-art performance Based on the abovemethods, we won the winner prize in segmentation competition of PASCAL VOCChallenge 2012

Trang 10

List of Tables

2.1 List of notations 35

Dataset [1] by changing the size of the local patches and super-pixels 51

Dataset [1] based on different object detectors BA is the baselinealgorithm based on coarse masks, while Proposed is our sparse recon-struction framework 512.4 Study of the effects of different algorithm parts of the proposed method

on the VOC’10 trainVal dataset [1] in IoU accuracy deﬁned in (2.14) 52

de-ﬁned in (2.14) provided by the previous methods on the VOC’07,VOC’10, VOC’11 and VOC’12 test datasets [1] Note that the meth-ods marked with * use extra annotation to train the model, while allother methods are trained by making use of the VOC annotations only 552.6 Comparison with the state-of-the-art methods in the Weizmann-HorsesDataset [2] Accuacy measured by the percentage of the correctly la-beled pixels 602.7 Statistics of the segmentation accuracy (δ) obtained from the pro-

on VOC 2011 test dataset [1] 75

Trang 11

3.2 Comparison of segmentation accuracy provided by previous methods

on VOC 2012 test dataset [1] 75

GrabCut-50 dataset 82

4.1 The eﬀects of diﬀerent pooling methods with various numbers of categories (t) for BC modeling in terms of the average IoU accuracy,

4.2 The eﬀects of diﬀerent CRF models in terms of the average IoU

accu-racy, deﬁned in (4.9), on the PASCAL VOC 2012 TrainVal dataset [1].102

4.3 State-of-the-art comparison in terms of IoU accuracy, deﬁned in (4.9),

obtained on the PASCAL VOC 2012 Test dataset [1] The results in

parentheses are evaluated on an extended training dataset containingextra annotation provided in [3] 104

per-class accuracy, deﬁned in (4.10) The Prop-WeaklySup method

uses no annotation data from the 6 background categories and is

evaluated only on the 15 foreground objects; while the Prop-FullySup

method uses the annotation data across all the 21 background

cate-gories FgAvg is the average per-class performance of 15 foreground objects, while FullAvg is the average performance of all the 21 cate-

gories 108

Trang 12

given test image, the object bounding boxes with their class labels arepredicted They are then cropped and ﬁgure-ground segmentationsare calculated through a coupled sparse reconstruction framework bymaking use of a set of training images The ﬁnal result is obtained

by merging the results of the ﬁgure-ground segmentation followed bypost-processing 34

reconstruction For a given bounding box predicted by the object tector in the test image, all training images for the same category arecropped and normalized, and then the segmentation mask is obtainedthrough a sparse image and mask reconstruction framework For thelocal part reconstruction, the bounding box and the training masksare ﬁrst partitioned into small regular patches and those patches arereconstructed locally to handle local distortions 36

Trang 13

de-2.3 An example for the reconstruction error by applying1 (b) and2(c)norm in the mask reconstruction The ground truth and the recon-structed image are put in the red and green channel, respectively The

overlapping area (i.e the region of correct reconstruction) is shown

by yellow δ stands for the absolute diﬀerence of the ground truth

and the reconstructed image 38

initializations of the picture sheep presented on Fig 2.7 The small

variance of the accuracy demonstrates that the global optimal tion is obtained from the proposed optimization framework regardless

solu-of the initialization 49

performance (IoU, deﬁned in (2.14)) for a rigid category, bus (top) and

an articulated category, human (bottom) x-axis is in the logarithmic

scale 52

dataset [1] obtained by our baseline (BA), the coupled global and localreconstruction framework (BA+GL) and the Full model (BA+GL+LO) 54

(Bet-ter viewed in color) The first four rows shows results of images tains single and multiple objects from the same category while thelast two rows shows results of more complicated images containingmultiple interacting objects The results are overlaid on the imageswith white boundaries and different colors corresponding to differentcategories as specified in VOC [1] 572.9 Some failure examples on the VOC’12 dataset [1] due to wrong recon-

con-struction, mis-detection (e.g the puppet is mis-detected as human)

as well as wrong labeling (e.g the dog is wrongly labeled as cat) . 59

Trang 14

2.10 Some exemplar results, overlaid on the images with yellow color andwhite boundaries, on the Weizmann-Horses Dataset [2] obtained from

pixels 61

3.2 Overview of the proposed approach First, the object bounding boxeswith detection scores are extracted from the test image Then, avoting based scheme is applied to estimate object shape guidance Bymaking use of the shape guidance, a graph-cut-based ﬁgure-groundsegmentation provides a mask for each bounding box Finally, these

3.4 Eﬀect of the distortion of bounding box 76

pose, cluttered background and occlusion 77

our proposed method (DET3-Proposed) The results are overlaid onthe images with white boundaries and diﬀerent colors correspond todiﬀerent categories (Best viewed in color.) 78

The results are overlaid on the images with white boundaries and ferent colors correspond to diﬀerent categories (Best viewed in color.)The ﬁrst image is due to mis-detection of the small horse The secondone is due to wrong bounding box prediction, since the cloth is la-belled as person and the parrot (bird) is mis-detected The third one

dif-is due to inaccurate bounding box prediction (i.e wrong label for the

bottle) resulted in inaccurate estimation in the graph-cut formulation 80

Trang 15

3.8 Some segmentation results, overlaid on the images with blue colorand white boundary, on the GrabCut-50 dataset [5] obtained by theproposed method 82

4.1 Illustration of the role of background context information (e.g sky or

indoor) In many cases it can help recognize the objects (e.g the bird

instead of boat or the potted plant instead of tree). 874.2 Overview of the proposed solution First, a pool of object hypothesesare generated A fully connected hypothesis graph is then built tomodel the relationship between the possible overlapping segments Anovel background contextual cue is predicted for the segments via sub-category classifiers The scores are fed into the CRF model togetherwith other cues like image classification and object detection Finally,the coarse segmentations obtained via MAP inference are merged andpost-processed to achieve the final segmentation result 884.3 Exemplar sub-category clusters for the horse category from the PAS- CAL VOC 2012 TrainVal dataset [1] Each row shows some images

with a certain sub-category It is observed that each cluster shares

signiﬁcant consistency among both the foreground horse objects and

the background regions 934.4 Illustration of the eﬀects of post-processing parameters: τ1, τ2 (Top)

(4.9), on the PASCAL VOC 2012 TrainVal dataset [1] 101

TrainVal dataset [1] after separately applying supervised cues (i.e.

CLS and DET cues), the weakly-supervised BC cues, and their bination referred to as the Full model The results obtained by usingthe unary term exclusively are also shown The baseline accuracy isshown in parentheses 103

Trang 16

com-4.6 Qualitative illustration for the effects of different contextual cues,like image classification (CLS), object detection (DET) and the back-ground context (BC) When all the three cues are applied, we refer

to it as the Full model The results corresponding to the last umn are obtained by using the unary terms exclusively in the CRFmodel The results are overlaid with white boundaries and diﬀerentcategories are highlighted with diﬀerent colors 103

PASCAL VOC 2012 Test dataset [1]. 105

The results are overlaid with white boundaries and diﬀerent categoriesare highlighted with diﬀerent colors Some failure cases due to wronglabelling and/or missed prediction are shown in the last column For

instance, the dog body is wrongly labeled as cat in the ﬁrst row; the cloth is labelled as human due to the low scores The bird is missed, since its large portion is occluded The bus and the bottle in the last

two rows are heavily occluded and the contrast is also too low 107

MSRC-21 dataset [4] The results are overlaid with white boundaries anddiﬀerent categories are highlighted with diﬀerent colors The last col-

umn shows the failure cases In the ﬁrst one the face is mis-labeled, while in the second one the face is wrongly labeled as book 110

Trang 17

Chapter 1

Introduction

Artificial Intelligence, defined as the science and engineering of making intelligentmachines like human, which can perceive the surrounding environment and takeappropriate actions, has drawn tremendous interests among researchers since theinvention of computer The AI field is interdisciplinary, in which different number

of sciences converge, including computer science, psychology, linguistics, philosophy

and neuroscience, etc The core problems of AI research include perception,

rea-soning, knowledge, planning, learning, natural language processing (communication)and the ability to move and manipulate objects

To enable an intelligent machine system, perception is among the few importantproblems that need to be solved ﬁrst Machine perception, meaning the ability to

use input from sensors (e.g cameras, microphones, sonar and other more exotic ones, etc ) to get the knowledge of the world, are mainly divided into some sub-

ﬁelds, like computer vision and speech recognition Computer vision is the ability

to analyze and understand the visual input, which is a very important branch ofmachine perception, since human visual information takes more than 70% of all theperceived information One key sub-ﬁeld of computer vision is visual recognitionthat help machines to recognize and localize important concepts in the world Thisthesis mainly focus on some research on diﬀerent tasks of visual recognition

It is observed that human, even three year old kid could perform extremely well

in most vision tasks like face recognition and object recognition, therefore how to

Trang 18

Figure 1.1: Different sub-fields of visual recognition Classification predicts theobject category labels at image level, detection localize the objects by boundingboxes, while segmentation assigns every pixel an object category it belongs to.

emulate such capabilities by computer has been the focus of many visual recognitionresearchers Visual recognition is usually divided into three sub-tasks: classification,detection and segmentation (See Fig 1.1) As shown in Fig 1.1, given a test image,object classification tends to predict the presence/absence of an example from acertain category within the image; object detection will not only predict the categorylabels of the objects, but also the exact localization constrained by bounding boxes;object segmentation goes further by distinguishing different objects into pixel-level,generating pixel-wise segmentations giving the class of the object visible at eachpixel

In this thesis, we aim to solve the semantic segmentation problem Strictlyspeaking, object segmentation is not a well posed problem, since the deﬁnition ofobject is not quite accurate For example, a forest could be seen as an object as awhole, but individual trees could be localized and recognized at ease when viewedfrom a closer distance Or a shirt can be either considered as an individual object

or part of the human body Therefore, it is necessary to pre-deﬁne a subset ofpossible category labels to be assigned to every pixel In general, semantic objectsegmentation is a very challenging problem, mainly due to diﬀerent poses, scales,positions, illumination, partial occlusion and large intra-class variety

Trang 19

Previous methods often fell into the pipeline of over-segmentation [6–8], regionrepresentation [9, 10], and region labeling inference [11, 12] A test image is firstover-segmented into some coherent regions like super-pixels [6,8], then some local ormid-level features are extracted, like SIFT [13], HOG [14], LBP [15] to represent thelocal or mid-level regions, finally all of the features are fed into some classificationmodels like SVM or CRF for the final labeling inference [11, 16] Though suchpipeline has achieved significant progress in the past decades, the representationalpower of the local features are often limited without considering the higher-levelcontext information Since it has been proved that the classification, detection andsegmentation are indeed the same problem in different scales, from higher image-level classification, to mid-level detection till the pixel-level segmentation, they arequite correlated that each of them could provide useful contextual information forthe others For example, the classification results could provide image level labelsthat could narrow down the labeling space of the local regions, while detection couldprovide localization information that could enforce some spatial constraint to theneighboring local regions Furthermore, the segmentation results which are alignedwith the exact object boundaries can directly converts to the tight bounding boxinformation to refine the detection results, or provide better spatial support forfeature pooling in order to boost the performance of image classification.

In this thesis, we mainly focus on the eﬀects of various contextual information

on visual object segmentation problems We explored various ways to obtain andutilize these contextual information, including global image classiﬁcation, objectsegmentation, inter-object occurrence and object-background relationships Before

we delve into the details, we ﬁrst give a comprehensive review about the historicalbackground of the related works

1.1 Historical Background

In this section, we first briefly review image classification and detection, then wepresent the historical background of image segmentation and semantic segmentation,

Trang 20

ﬁnally we will introduce contextualization, including various methods to obtain andutilize the context information.

1.1.1 Image Classiﬁcation

Image classiﬁcation methods mainly fall into two categories, bag-of-words(BOW)based [10, 17–19] and deep learning based [20–23] Traditional BoW models usually

follow a standard pipeline, e.g , feature extraction, feature coding, feature

pool-ing, and classiﬁcation For feature extraction, various low level features includingSIFT [13], HOG [14], LBP [15] are extracted either in a dense grid or from sparseinterest points Then diﬀerent coding schemes like Vector Quantization (VQ) [24],

Sparse Coding [25], Linear Locality Coding (LLC) [26], Fisher Kernel [27] etc are

applied to code the features into a consistent representation, followed by some ing techniques like Spatial Pyramid Matching (SPM) [10], ScSPM [28], GeneralizedHierarchical Matching (GHM) [18] to generate the final global image representa-tion Finally for classification, conventional models include random forests [29] andSVM [30] are utilized to classify the global features into different categories Re-cently, beyond modeling the region features alone, some context information likespatial location of the object and background scene are integrated to improve theclassification accuracy, like [17,18] Although traditionally BoW based methods hasachieved great progress in different benchmarks like [1] and many of the papersare trying to improve this pipeline, the involved hand-crafted features requires greatskills to design and may not be optimal for different tasks, which significantly limitsthe generalization of such model across different datasets

pool-In contrast to the hand-crafted features, learnt features from deep neural works have shown great potential in various visual recognition tasks, especially clas-siﬁcation Deep learning methods try to extract the high-level abstraction for visualdata by multiple non-linear transformations, like convolution and max-pooling inthe state-of-the-art Convolutional Neural Network (CNN) It has demonstrated ex-traordinary power in a lot of database like ImageNet [31] Furthermore, it has beenproved that the CNN models pre-trained on a large datasets with enough data di-

Trang 21

net-versity and category distribution, can be transferred to extract features for otherimage datasets possibly without enough training data [21–23, 32].

1.1.2 Object Detection

Similar to image classification, mainstream object detection methods also fall intotwo categories, sliding window based [14, 33–36] and deep learning based [37–39].Since the object of interest can be found at any location with different poses in theimage, sliding window has been the dominant paradigm for quite a long time Byexhaustively searching the sliding windows with different locations, scales and aspectratios, the detection problem becomes an image classification problem that tries to

determine whether the window contains the object of interest or not Dai et al.

[14] ﬁrst proposed an eﬃcient pedestrian detection algorithm based on HoG feature

Felzenszwalb et al [33, 34] extended to more complicated categories by introducing

the Deformable Part Based model that models the relationship between diﬀerentobject parts The DPM model has been the state-of-the-art for a long time in a lot

of challenging benchmarks, e.g [1] Many of the models [35, 36] followed similar

pipeline However, the performance of such methods has been plateaued in the pastfew years One possible reason is that only the information inside the bounding box

is modeled without taking consideration the surrounding context information, whileother reason is due to the limited representation power of the hand-crafted featureslike HoG For instance, HoG can well model rigid objects with large edge contrast,but not quite accurate for articulated objects with textures

Motivated by the impressive achievements of deep learning features in imageclassiﬁcation tasks, many researchers tried to extend the deep neural network into

object detection [37–39] Szegedy et al [37] made the ﬁrst step by proposing a

simple formulation of object detection as a regression problem to object bounding

boxes based on CNN networks Later, Erhan et al [38] proposed a saliency-inspired

neural-network model for detection, where a set of class-agnostic bounding boxesalong with the score of conﬁdence are predicted The model can handle a variety ofinstances for each class and allows for cross-class generalization at the highest levels

Trang 22

of the network However, the performance of the above models are just comparative

to the state-of-the-art DPM based models Girshick et al [39] applied a proposal

based CNN framework pre-trained on the ImageNet classiﬁcation dataset [31] toextract CNN features from a large set of object proposals [40], and used linear SVM

to classify, followed by some bounding box regression to refine the final detectionresults The method achieved significantly superior performance than all of thestate-of-the-art methods, demonstrating the potential of such CNN-based methods

1.1.3 Image Segmentation

Image segmentation, aiming to segment an image into several coherent regions,has a very long history One of the ﬁrst methods was published more than 40years ago [41] It iteratively merge small patches with similar gray-level statisticsfrom a seed patch, till none of the neighboring patches are suﬃciently similar to thecurrently merging region This method is quite intuitive that taking advantage of one

of the fundamental grouping heuristic that neighboring pixels having diﬀerent colorstend to belong to diﬀerent objects, however, it could not handle complicated patternslike texture Afterwards, Mean Shift [42], Normalized Cuts [43] and Graph BasedSegmentation method [8] are provided to generate better segmentation results Suchmethods tend to generate a single optimal segmentation that covers the whole image

in a non-overlapping manner, which is quite diﬃcult due to the ambiguity of thelow and mid-level cues

Some other methods generate multiple segmentations from an image, either in

an independent manner or organized in a hierarchy Russel et al [44] proposed to

discover objects by computing Normalized Cuts for diﬀerent number of segmentsand image sizes Malisiewics [45] tried to merge pairs and triplets of segments ob-

tained from Mean-shift [42] or Normalized Cut [43] Rabinovich et al [46] selected

reoccurring segments as they are potentially more stable For the hierarchical

seg-mentation, Sharon et al ﬁrst generateed and combined multiscale measurements of

intensity contrast, texture diﬀerence and boundary integrity [47], and later [48] posed a multi-grid normalized cut technique to generate segmentation result from

Trang 23

pro-multiple level of granularity Arbel´aez et al [49] produced a segment hierarchy,

called ultrametric contour map (UCM), by iteratively merging superpixels based onthe learned globalPb boundary detector [50] The hierarchy is a very robust repre-sentation since image contents are intrinsically compositional, however such singlehierarchy structure is prone to errors because errors in one level tend to propagate

to all the coarser levels

In order to obtain more coherent spatial results, more statistics informationabout the real world images should be leveraged Some salient object detection

algorithms [51,52] tend to generate segmentation based on human attention Ren et

al [53] used a classiﬁer to distinguish good segmentations that are generated by

combining diﬀerent superpixels Endres et al [54] applied a learned aﬃnity measure

to group superpixels into more meaningful regions based on a structured learning

approach Levenshtein et al [55] developed a ﬁgure-ground segmentation based

on parametric max-ﬂow principles Similarly, Carreira et al [56] proposed a

non-parametric min-cut algorithm to generate a large set of object hypotheses

1.1.4 Semantic Segmentation

While image segmentation tries to generate some meaningful regions, semantic mentation will further assign category labels to each region Generally, semanticsegmentation methods are either bottom-up or top-down

seg-Bottom-up Approaches

Bottom up approaches extract various low and mid-level image features (e.g SIFT [13],

HOG [14], texton [57] ) and try to ﬁnd homogeneous segments based on the image

cues A common approach is to use graphical representation [47,58], where the nodes

represent pixels or super-pixels, and the graph is partitioned into several subgraphscorresponding to diﬀerent regions In [11], a large pool of ﬁgure-ground hypothe-

ses are generated by solving constrained parametric min-cut (CPMC) [56] problems

with various choices of the parameter The hypotheses are ranked and classiﬁed bysupport vector regressors (SVR) based on their “objectness” Analogous to aver-

Trang 24

age and max-pooling, second order pooling (O2P) is proposed in [9] to encode the

second order statistics of local descriptors By employing this pooling technique asigniﬁcant improvement can be achieved In [59] a composite statistical inferencemethod (CSI) is applied to the original CPMC method Generally, CPMC-basedmethods [9, 11, 56] alleviate the problem by exploiting object-level segments thathave high overlap with ground truth objects However, the inter-segment relation-ships and the segment background information are generally not very well modelled,especially for visually confusing categories Hence, they still cannot guarantee theperfect classiﬁcation and ranking of the segments K¨uttel et al [60] proposed a

ﬁgure-ground segmentation framework, in which the training masks are transferred

to object windows on the test image based on visual similarity Then, these masksare used to derive appearance and location information for graph-cut-based mini-mization In [61], similar idea is proposed and a class-independent shape prior isintroduced to transfer object shapes from an exemplar database to the test image.This prior information is enforced in a graph-cut formulation to obtain ﬁgure-groundsegmentation Generally, bottom-up methods provide visually coherent segments,nevertheless the main diﬃculty of these methods is that the local information is usu-ally ill-posed Therefore, local methods without modelling objects globally tend togenerate visually consistent segmentation instead of semantically meaningful ones

Top-down Approaches

Top-down approaches generally rely on acquired class-speciﬁc information, e.g class label, object bounding box [60, 62] and shape model [63] Brox et al [64] applied so-called poselets to predict masks for numerous parts of an object The pose-

lets are aligned to the object contours, and then they are aggregated into an ject Arbel´aez et al. [65] proposed region-based object detectors that integratetop-down poselet detector and global appearance cues This method [65] producesclass-specific scores for the regions and aggregates multiple overlapping candidatesthrough pixel classification in order to get the final segmentation results A unifiedprobabilistic framework based on supervised learning is presented in [66] For the

Trang 25

ob-local appearance model Fisher kernel is used and the segmentation process is guided

by low-level segmentations which enforce local and global image-level consistency.Another semantic segmentation method was presented in [67], where sparse coding

is introduced as a high level descriptor of the regions, which contributes to less tization error than traditional bag-of-word (BoW) method Shape prior can also be

quan-a useful top-down guidquan-ance The work in [68] models quan-a cquan-ategory of shquan-apes quan-as quan-aﬁnite dimensional manifold, the shape prior manifold which are approximated fromthe shape samples using the Laplacian eigenmap technique Rathi [69] proposed touse a projection method from the shape space to the manifold, while Walder [70]constrained the mapping between the shape space and the lower dimensional man-ifold space to be a diﬀeomorphism which could ensure the pairwise distance to bepreserved [63,71] proposed frameworks to do object segmentation using shape priorwith graph cut In the case of top-down methods the main challenge is to obtain theobject templates, especially for objects with relatively large intra-class appearanceand pose variations

incor-and global occurrence information [12] to guide the graphical inference Boix et

al [16] introduced the so-called “harmony potential”, which integrates global

cate-gory label information as well as object detectors in order to better fuse global andlocal information Although CRF-based models have a strong generalization ability

to integrate diﬀerent cues, the modelling and training of these kinds of methods arerelatively diﬃcult due to the expensive estimation of the partition function

Trang 26

1.1.5 Obtaining Contextual Information

The importance of context to obtain good classiﬁcation and detection has been nessed in many papers [17, 73] In practice, context can be any information notdirectly produced by the appearance of an object In many cases, local appearance

wit-is not enough to correctly classify a local image region, while context wit-is able toprovide useful guidance to disambiguate [17, 73] introduced some contextualizationtechnique between detection and classiﬁcation Numerous contextual cues, like theglobal scene layout [74], object detection [64, 65, 75, 76], as well as the interactionbetween objects and regions [77–80] are integrated in a variety of visual recognitionmodels The method proposed by Cinbis and Sclaroﬀ [80] makes use of relativelocations and scores between pairs of detections Tu and Bai proposed a learning

algorithm for high-level vision problems (e.g medical image segmentation), called

auto-context [81] It [81] ﬁrst learns a classiﬁer on local image patches, and then

the discriminative probability maps are fed into the new classiﬁer as contextual formation to boost the performance Bul´o et al [82] introduced structured local

in-predictors to exploit contextual relations from complex interactions between labelsand intermediate representation of the image data in a convex structured learningproblem The above contextualization are mostly based on detection and classiﬁca-tion In this work we will comprehensively explore how to utilize diﬀerent contextcues to help the object segmentation task

1.2 Thesis Focus and Contributions

Since the topic of this thesis is semantic object segmentation, and it has been provedthat image classiﬁcation, object detection and semantic segmentation are highlycorrelated, however, most of the contextualization are between detection and clas-siﬁcation Currently the research gaps in context based segmentation are as follows

• Most of the current methods are still based on the local features, without

considering the contextual information around the regions or in a higher scale

Trang 27

level Usually the relations between the object of interest with other regions

(e.g background) or higher-level image are not well modeled.

• Detection provides a very important top-down guidance in segmentation but

is not fully utilized

• Since segmentation is pixel-wise labeling, the annotation burden is much

heav-ier than classiﬁcation and detection, thus how to train a good enough modelbased on the limited training data is a great challenge Furthermore, most

of the benchmarks i.e [1] contains large portions of unlabled background

re-gions without explicit labeling, how to exploit such clutterness to boost thesegmentation of the object of interest is also very important

To narrow down the above research gap, the thesis proposed a context basedobject segmentation system, aiming to investigate the eﬀects of diﬀerent contexts,especially detection and unlabled background regions Here we mainly list the maincontributions of the thesis:

• Detection based sparse reconstruction: we ﬁrst present one of the main

con-tribution of the thesis: a unified framework for detection based segmentationvia coupled global and local sparse representations The chapter begins withthe formulation of such coupled sparse reconstruction framework based on thepredicted bounding boxes from object detectors We then provide an efficientalternating optimization scheme to jointly pursue the optimal latent maskand the sparse reconstruction coefficients and some post-processing technique

to generate the ﬁnal segmentation task The chapter concludes with sive results on various benchmark datasets like PASCAL VOC segmentationdataset [1] and Weizmann Horses Dataset [2] with detailed comparison withthe state-of-the-art methods

exten-• Segmentation without annotating segments: After we introduced the

super-vised segmentation framework based on sparse reconstruction, motivated bythat the segment annotation is usually much more diﬃcult to obtain than

Trang 28

object bounding boxes In Chapter 3, we proposed another detection basedsegmentation approach without any additional segmentation annotation fromeither the training set or user interaction We ﬁrst introduced a shape guidedMarkov Ramdom Field(MRF) model, and then we introduced how to obtainsuch shape guidance from a set of object segment hypothesis by a simple votingscheme Then we apply the derived shape guidance into the subsequent graph-cut-based ﬁgure-ground segmentation Final segmentation result is obtained

by merging the segmentation results in the bounding boxes

• Background context augmented hypothesis graph: besides global classiﬁcation

and detection, we further explore how to obtain the contextual cues from theunlabeled background regions that are usually ignored We ﬁrst introduced

a fully connected CRF model that allows integration for diﬀerent contextualcues The CRF model is considered over a set of overlapping CPMC ob-ject hypothesis [56] Then the background contextual cues are learned in aweakly-supervised manner and applied in the unary term of the correspondingforeground regions Then ﬁnal segmentation result is obtained via maximum-a-posteriori (MAP) inference, where the segments are merged based on a se-quential aggregation manner

1.3 Organization of the thesis

The remaining parts of the thesis are organized as follows: Chapter 2 introduces thedetection based segmentation methods via optimal sparse reconstruction, Chapter

3 presents the learning-free pipeline of segmentation without annotating segments,while Chapter 4 introduces an integrative hypothesis graph augmented by back-ground context and other contextual information Finally, in Chapter 5, we providesome discussion about some promising future directions and the conclusion of thewhole thesis

Trang 29

1.3.1 Relevant Publications

The main material in this thesis comes from some journals and conference ings The relevant publication of each chapter are listed as follows:

proceed-Chapter 2:

• Wei Xia, Zheng Song, Jiashi Feng, Shuicheng Yan, Loong Fah Cheong

Seg-mentation over Detection by Coupled Global and Local Sparse tions In ECCV, Firenze, Italy, Oct 7-13, 2012 [83]

Representa-• Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan Segmentation

over Detection by Optimal Sparse Representations In IEEE Transactions onCircuits, Systems and Video Technology (TCSVT) 2014 [84]

Chapter 3:

• Wei Xia, Csaba Domokos, Jian Dong, Loong Fah Cheong, Shuicheng Yan.

Semantic Segmentation without Annotating Segments, in ICCV, 2013 [85].Chapter 4:

• Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan Background

Context Augmented Hypothesis Graph for Object Segmentation In IEEETransactions on Circuits, Systems and Video Technology (TCSVT) 2014 [86]

Trang 30

Chapter 2

Segmentation over Detection via Optimal Sparse Reconstructions

In this chapter, we address the problem of semantic segmentation, where the

possi-ble class labels are from a pre-deﬁned set We exploit top-down guidance, i.e the

coarse localization of the objects and their class labels, provided by object detectors.For each detected bounding box figure-ground segmentation is performed and thefinal result is achieved by merging the figure-ground segmentations The main idea

of the proposed approach, is to reformulate the ﬁgure-ground segmentation lem as sparse reconstruction pursuing the object mask in a non-parametric manner.The latent segmentation mask should be coherent subject to sparse error caused byintra-category diversity, thus the object masks are inferred by making use of sparserepresentations over the training set In order to handle local spatial deformations,local patch-level masks are also considered and inferred by sparse representationsover the spatially nearby patches The sparse reconstruction coeﬃcients and thelatent mask are alternately optimized by applying the Lasso algorithm and the Ac-celerated Proximal Gradient method The proposed formulation results in a convexoptimization problem, thus the global optimal solution is achieved In this chapter

prob-we provide theoretical analysis of the convergence and optimality We also give anextended numerical analysis of the proposed algorithm and a comprehensive com-

Trang 31

parison with the related semantic segmentation methods on the challenging CAL VOC object segmentation datasets and the Weizmann-Horses Dataset Theexperimental results demonstrate that the proposed algorithm achieves competitiveperformance comparing to the state-of-the-arts.

PAS-2.1 Introduction

Localizing and recognizing objects efficiently in a complex visual scene is one ofthe amazing capabilities of human cognitive system Thus to investigate how it ispossible to emulate such capabilities by computers has attracted lots of interest inthe field of computer vision and machine learning [11, 66, 81, 87] The core sub-tasks of this area are object classification, detection and segmentation [10, 11, 87].Classification tells whether an image contains a certain object or not, detectionlocalizes the object by providing its bounding box, while segmentation aims to

assign class labels to each pixel A special case of segmentation is called semantic

segmentation, where the possible class labels are from a pre-deﬁned set In this

chapter we focus on semantic segmentation

The recent segmentation methods mainly fall into two groups The ﬁrst one

involves bottom-up methods, which ﬁrst extract various low and mid-level image features (e.g texton, SIFT, HOG) and try to ﬁnd homogeneous segments based on

those image cues [11] A common approach is to construct a graphical representation

of the problem [7], where the nodes represent pixels or super-pixels and the edgesconnect neighbouring nodes, and then the graph is partitioned into several sub-graphs corresponding to diﬀerent object regions In general, bottom-up approachesprovide visually coherent segments, but the main diﬃculty is usually due to ill-posed

local information The other group of segmentation methods includes top-down

approaches, which generally rely on acquired class-speciﬁc information, e.g class

label, object bounding box [60, 62] and shape model [63] In the case of top-downmethods the main challenge is to obtain the object templates, especially for objectswith relatively large intra-class appearance and pose variations

Trang 32

Both bottom-up and top-down approaches suﬀer from some intrinsic drawbacks,thus some methods [12, 16, 72] utilize the mixture of those approaches Ladick´y et

al [72] proposed a multilevel hierarchical Conditional Random Field (CRF) model

to incorporate information from diﬀerent scales, which is combined with top-downdetectors and global occurrence information [12] to guide the graphical inference

Boix et al [16] introduced the so-called “harmony potential”, which integrates global

category label information as well as object detectors in order to better fuse globaland local information Although CRF-based models have a strong generalizationability to integrate diﬀerent cues, the modelling and training of these kinds of meth-ods are relatively diﬃcult due to the expensive estimation of the partition function

2.1.1 Motivation and Contributions

Due to the intrinsic drawbacks of the purely bottom-up methods, we aim to exploittop-down guidance as well Motivated by the rising performance of object detectors,

in many applications detection is performed, which gives the coarse localization ofthe object by a bounding box Thus object detectors [35, 36] provide a useful top-down guidance However, they lack the accuracy to precisely identify the object atpixel level Intuitively, for each detected bounding box ﬁgure-ground segmentationcan be performed

Sparse representation has seen significant impact in several computer visionproblems, for example in face recognition [88], image classification and segmen-tation [28, 67], as well as in motion and data segmentation [89] Furthermore, it ispreviously verified [90] that sparse coding provides better results in finding relatedsamples for a given image than other reconstruction methods In practice, due tothe large variance of poses, colors and texture, the training dataset should be verylarge in order to handle such variations However, in our case, the object bound-ing box is available and it is normalized, hence the variance in scales and positionsare alleviated to some extent Motivated by the efficiency of sparse representation,the figure-ground segmentation problem is reformulated as sparse reconstructionpursuing the latent object mask in a non-parametric manner In this chapter, we

Định dạng
Số trang	64
Dung lượng	444,71 KB