We first proposed a detection based method that formulates the segmenta-tion task as pursuing the optimal latent mask in a nonparametric manner inside thepredicted bounding box via sparse
Trang 1Context-based Visual Object Segmentation
Wei Xia
(B Eng, Huazhong University of Science and Technology)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF
PHILOSOPHY
Department of Electrical and Computer Engineering
National University of Singapore
2014
Trang 3Throughout the four years of my PhD study, there are a lot of people to thank forthe help and support they have provided First and foremost, I’d like to express mygreat gratitude to my two supervisors, Prof Loong Fah Cheong and Prof ShuichengYan Specifically, in the first semester, Prof Cheong helped me to lay a solid theo-retic foundation and find my research interest from a wide range of topics in the field
of computer vision and machine learning Then under the patient guidance of ProfYan, I managed to finish some work in object semantic segmentation, which formsthe main body of the thesis I enjoyed working with them, their passion and pro-fessionalism in research, dedication to the details, complete commitment and greatpersonality have significantly inspired me and will keep benefit me in my future life.Then I would like to express my thanks to my seniors Ju Sun and Jiashi Fengfor their patient guidance when I was struggling at the beginning of my PhD study.Special thanks also goes to Dr Csaba Domokos for his great professionalism andperfectionism, helping me win the PASCAL VOC Challenge and publish top-tieredpapers I learned a lot from the great experience of collaboration with him I alsowant to thank Jian Dong and Junshi Huang, who are both my room-mates andlab-mates and provided me a lot of help in both academic and life I will alwaysremember the days when we discussed till late night Besides, I met a lot of greatfriends here in Learning and Vision Lab, Qiang Chen, Zheng Song, Luoqi Liu, Min
Lin, Si Liu, Mengdi Xu, etc Furthermore, I’d like to express my sincere gratitude to
Mr Zhongyang Huang, for providing me the opportunity of internship in PanasonicSingapore Laboratory Under his guidance, I learned a lot about industrial research
Trang 4from the interesting projects we have done together.
Last but not least, I want to thank my parents for their everlasting support andcare Finally I want to express my appreciation to my wife, Chong Chen Withouther love, companion and encouragement during the difficult times, I would have not
be able to achieve this goal along this long journey of PhD study This thesis isdedicated to her
Trang 51.1 Historical Background 18
1.1.1 Image Classification 19
1.1.2 Object Detection 20
1.1.3 Image Segmentation 21
1.1.4 Semantic Segmentation 22
Bottom-up Approaches 22
Top-down Approaches 23
Integrative Approaches 24
1.1.5 Obtaining Contextual Information 25
1.2 Thesis Focus and Contributions 25
1.3 Organization of the thesis 27
1.3.1 Relevant Publications 28
2 Segmentation over Detection via Optimal Sparse Reconstructions 29 2.1 Introduction 30
2.1.1 Motivation and Contributions 31
2.1.2 Related Work 32
2.2 Proposed Solution 34
2.2.1 Figure-ground Segmentation 35
2.2.2 Coupled Global and Local Reconstruction 38
2.3 Optimization Procedure 40
Trang 62.3.1 Optimization with respect to x and ˜ x1, , ˜x r 40
Sub-problem 1: x 40
Sub-problem 2: ˜x1, , ˜x r 41
2.3.2 Optimization with respect to m 41
2.4 Numerical Implementation 45
2.5 Experiments 48
2.5.1 Convergence Analysis 49
2.5.2 Effects of the Size of Local Patches and Super-pixels 50
2.5.3 Effects of the Detection Results 51
2.5.4 Parameter Estimation 52
2.5.5 Proof-of-Concept Experiments 53
2.5.6 Comparison on the PASCAL VOC Datasets 54
Performance gain from mask refinement 55
Comparison on VOC’10, VOC’11 and VOC’12 56
2.5.7 Comparison on the Weizmann-Horses Dataset 59
2.6 Chapter Summary 62
3 Semantic Segmentation without Annotating Segments 64 3.1 Introduction 64
3.2 Related Work 66
3.3 Proposed Solution 68
3.3.1 Bounding Box Score Normalization 68
3.3.2 Object Shape Guidance Estimation 69
3.3.3 Graph-cut Based Segmentation 71
3.3.4 Merging and Post-processing 74
3.4 Experimental Results 74
3.4.1 Proof of the Concept 75
3.4.2 Comparison with the State-of-the-arts 77
3.5 Chapter Summary 82
Trang 74 Background Context Augmented Hypothesis Graph for Object
4.1 Introduction 85
4.2 Related Work 87
4.3 Proposed Solution 91
4.3.1 CRF-based Formulation 91
4.3.2 Background Context Modeling 93
4.3.3 Merging and Post-processing 95
4.4 Implementation Details 97
4.5 Experimental Results 99
4.5.1 Proof-of-concept Experiments 99
Effects of the sub-category numbert 100
Effects of the post-processing parameters 100
Effects of the CRF model 102
Effects of different contextual cues 102
4.5.2 Comparison with the State-of-the-arts 105
VOC 2012 105
MSRC-21 dataset 108
4.6 Chapter Summary 110
5 Conclusion and Future Work 112 5.1 Thesis Conclusion 112
5.2 Discussion of the Future Directions 114
Trang 8Visual object recognition is one of the most fundamental problems in artificial ligence, which mainly divides into three different tasks: object classification, objectdetection and object segmentation Classification tells what object the image con-tains; detection predicts the bounding box location of the object, while segmentationtends to assign category labels from a predefined label set to every pixel in the im-age In this thesis, we aim to solve the problem of object segmentation It hasbeen proved that the three tasks are significantly correlated that both classificationand detection can provide useful contextual information to guide the segmentationprocess We first proposed a detection based method that formulates the segmenta-tion task as pursuing the optimal latent mask in a nonparametric manner inside thepredicted bounding box via sparse reconstruction of the ground-truth masks overthe training set By taking into both the global and local constraints, a coupledconvex optimization framework is proposed By alternatively optimizing the sparsereconstruction coefficients and the latent optimal mask using Lasso and AcceleratedProximal Gradient methods, global optimal solution could be achieved
intel-Furthermore, since ground-truth segment annotation is generally very difficult
to obtain while object bounding boxes can be obtained in a much easier way Weproposed a segmentation approach based on detected bounding boxes without anyadditional segment annotation from either the training set or user interaction Based
on a set of segment hypothesis, a simple voting scheme is introduced to estimate theshape guidance for each bounding box The derived shape guidance is used in thesubsequent graph-cut-based figure-ground segmentation and the final segmentation
Trang 9result is obtained by merging the segmentation results in the bounding boxes.Finally, inspired by the significant role of the context information, besides globalclassification and detection, we explore the contextual cues from the unlabeled back-ground regions that are usually ignored A fully connected CRF model is consideredover a set of overlapping hypothesis from CPMC, and the background contextualcues are learned from the unlabeled background regions and applied in the unaryterms of the corresponding foreground regions The final segmentation result is ob-tained via maximum-a-posteriori (MAP) inference, where the segments are mergedbased on a sequential aggregation manner Note that the proposed model has stronggeneralization ability, other contextual cues like global classification and detectioncan be easily integrated into the model to further boost the performance In order
to evaluate the effectiveness of the proposed algorithms, extensive experiments areconducted on various benchmark datasets, ranging from the challenging PASCALVOC, to Weizmann Horse dataset, Grabcut-50, MSRC-21 dataset, etc The pro-posed approaches achieve new state-of-the-art performance Based on the abovemethods, we won the winner prize in segmentation competition of PASCAL VOCChallenge 2012
Trang 10List of Tables
2.1 List of notations 35
Dataset [1] by changing the size of the local patches and super-pixels 51
Dataset [1] based on different object detectors BA is the baselinealgorithm based on coarse masks, while Proposed is our sparse recon-struction framework 512.4 Study of the effects of different algorithm parts of the proposed method
on the VOC’10 trainVal dataset [1] in IoU accuracy defined in (2.14) 52
de-fined in (2.14) provided by the previous methods on the VOC’07,VOC’10, VOC’11 and VOC’12 test datasets [1] Note that the meth-ods marked with * use extra annotation to train the model, while allother methods are trained by making use of the VOC annotations only 552.6 Comparison with the state-of-the-art methods in the Weizmann-HorsesDataset [2] Accuacy measured by the percentage of the correctly la-beled pixels 602.7 Statistics of the segmentation accuracy (δ) obtained from the pro-
on VOC 2011 test dataset [1] 75
Trang 113.2 Comparison of segmentation accuracy provided by previous methods
on VOC 2012 test dataset [1] 75
GrabCut-50 dataset 82
4.1 The effects of different pooling methods with various numbers of categories (t) for BC modeling in terms of the average IoU accuracy,
4.2 The effects of different CRF models in terms of the average IoU
accu-racy, defined in (4.9), on the PASCAL VOC 2012 TrainVal dataset [1].102
4.3 State-of-the-art comparison in terms of IoU accuracy, defined in (4.9),
obtained on the PASCAL VOC 2012 Test dataset [1] The results in
parentheses are evaluated on an extended training dataset containingextra annotation provided in [3] 104
per-class accuracy, defined in (4.10) The Prop-WeaklySup method
uses no annotation data from the 6 background categories and is
evaluated only on the 15 foreground objects; while the Prop-FullySup
method uses the annotation data across all the 21 background
cate-gories FgAvg is the average per-class performance of 15 foreground objects, while FullAvg is the average performance of all the 21 cate-
gories 108
Trang 12given test image, the object bounding boxes with their class labels arepredicted They are then cropped and figure-ground segmentationsare calculated through a coupled sparse reconstruction framework bymaking use of a set of training images The final result is obtained
by merging the results of the figure-ground segmentation followed bypost-processing 34
reconstruction For a given bounding box predicted by the object tector in the test image, all training images for the same category arecropped and normalized, and then the segmentation mask is obtainedthrough a sparse image and mask reconstruction framework For thelocal part reconstruction, the bounding box and the training masksare first partitioned into small regular patches and those patches arereconstructed locally to handle local distortions 36
Trang 13de-2.3 An example for the reconstruction error by applying1 (b) and2(c)norm in the mask reconstruction The ground truth and the recon-structed image are put in the red and green channel, respectively The
overlapping area (i.e the region of correct reconstruction) is shown
by yellow δ stands for the absolute difference of the ground truth
and the reconstructed image 38
initializations of the picture sheep presented on Fig 2.7 The small
variance of the accuracy demonstrates that the global optimal tion is obtained from the proposed optimization framework regardless
solu-of the initialization 49
performance (IoU, defined in (2.14)) for a rigid category, bus (top) and
an articulated category, human (bottom) x-axis is in the logarithmic
scale 52
dataset [1] obtained by our baseline (BA), the coupled global and localreconstruction framework (BA+GL) and the Full model (BA+GL+LO) 54
(Bet-ter viewed in color) The first four rows shows results of images tains single and multiple objects from the same category while thelast two rows shows results of more complicated images containingmultiple interacting objects The results are overlaid on the imageswith white boundaries and different colors corresponding to differentcategories as specified in VOC [1] 572.9 Some failure examples on the VOC’12 dataset [1] due to wrong recon-
con-struction, mis-detection (e.g the puppet is mis-detected as human)
as well as wrong labeling (e.g the dog is wrongly labeled as cat) . 59
Trang 142.10 Some exemplar results, overlaid on the images with yellow color andwhite boundaries, on the Weizmann-Horses Dataset [2] obtained from
pixels 61
3.2 Overview of the proposed approach First, the object bounding boxeswith detection scores are extracted from the test image Then, avoting based scheme is applied to estimate object shape guidance Bymaking use of the shape guidance, a graph-cut-based figure-groundsegmentation provides a mask for each bounding box Finally, these
3.4 Effect of the distortion of bounding box 76
pose, cluttered background and occlusion 77
our proposed method (DET3-Proposed) The results are overlaid onthe images with white boundaries and different colors correspond todifferent categories (Best viewed in color.) 78
The results are overlaid on the images with white boundaries and ferent colors correspond to different categories (Best viewed in color.)The first image is due to mis-detection of the small horse The secondone is due to wrong bounding box prediction, since the cloth is la-belled as person and the parrot (bird) is mis-detected The third one
dif-is due to inaccurate bounding box prediction (i.e wrong label for the
bottle) resulted in inaccurate estimation in the graph-cut formulation 80
Trang 153.8 Some segmentation results, overlaid on the images with blue colorand white boundary, on the GrabCut-50 dataset [5] obtained by theproposed method 82
4.1 Illustration of the role of background context information (e.g sky or
indoor) In many cases it can help recognize the objects (e.g the bird
instead of boat or the potted plant instead of tree). 874.2 Overview of the proposed solution First, a pool of object hypothesesare generated A fully connected hypothesis graph is then built tomodel the relationship between the possible overlapping segments Anovel background contextual cue is predicted for the segments via sub-category classifiers The scores are fed into the CRF model togetherwith other cues like image classification and object detection Finally,the coarse segmentations obtained via MAP inference are merged andpost-processed to achieve the final segmentation result 884.3 Exemplar sub-category clusters for the horse category from the PAS- CAL VOC 2012 TrainVal dataset [1] Each row shows some images
with a certain sub-category It is observed that each cluster shares
significant consistency among both the foreground horse objects and
the background regions 934.4 Illustration of the effects of post-processing parameters: τ1, τ2 (Top)
(4.9), on the PASCAL VOC 2012 TrainVal dataset [1] 101
TrainVal dataset [1] after separately applying supervised cues (i.e.
CLS and DET cues), the weakly-supervised BC cues, and their bination referred to as the Full model The results obtained by usingthe unary term exclusively are also shown The baseline accuracy isshown in parentheses 103
Trang 16com-4.6 Qualitative illustration for the effects of different contextual cues,like image classification (CLS), object detection (DET) and the back-ground context (BC) When all the three cues are applied, we refer
to it as the Full model The results corresponding to the last umn are obtained by using the unary terms exclusively in the CRFmodel The results are overlaid with white boundaries and differentcategories are highlighted with different colors 103
PASCAL VOC 2012 Test dataset [1]. 105
The results are overlaid with white boundaries and different categoriesare highlighted with different colors Some failure cases due to wronglabelling and/or missed prediction are shown in the last column For
instance, the dog body is wrongly labeled as cat in the first row; the cloth is labelled as human due to the low scores The bird is missed, since its large portion is occluded The bus and the bottle in the last
two rows are heavily occluded and the contrast is also too low 107
MSRC-21 dataset [4] The results are overlaid with white boundaries anddifferent categories are highlighted with different colors The last col-
umn shows the failure cases In the first one the face is mis-labeled, while in the second one the face is wrongly labeled as book 110
Trang 17Chapter 1
Introduction
Artificial Intelligence, defined as the science and engineering of making intelligentmachines like human, which can perceive the surrounding environment and takeappropriate actions, has drawn tremendous interests among researchers since theinvention of computer The AI field is interdisciplinary, in which different number
of sciences converge, including computer science, psychology, linguistics, philosophy
and neuroscience, etc The core problems of AI research include perception,
rea-soning, knowledge, planning, learning, natural language processing (communication)and the ability to move and manipulate objects
To enable an intelligent machine system, perception is among the few importantproblems that need to be solved first Machine perception, meaning the ability to
use input from sensors (e.g cameras, microphones, sonar and other more exotic ones, etc ) to get the knowledge of the world, are mainly divided into some sub-
fields, like computer vision and speech recognition Computer vision is the ability
to analyze and understand the visual input, which is a very important branch ofmachine perception, since human visual information takes more than 70% of all theperceived information One key sub-field of computer vision is visual recognitionthat help machines to recognize and localize important concepts in the world Thisthesis mainly focus on some research on different tasks of visual recognition
It is observed that human, even three year old kid could perform extremely well
in most vision tasks like face recognition and object recognition, therefore how to
Trang 18Figure 1.1: Different sub-fields of visual recognition Classification predicts theobject category labels at image level, detection localize the objects by boundingboxes, while segmentation assigns every pixel an object category it belongs to.
emulate such capabilities by computer has been the focus of many visual recognitionresearchers Visual recognition is usually divided into three sub-tasks: classification,detection and segmentation (See Fig 1.1) As shown in Fig 1.1, given a test image,object classification tends to predict the presence/absence of an example from acertain category within the image; object detection will not only predict the categorylabels of the objects, but also the exact localization constrained by bounding boxes;object segmentation goes further by distinguishing different objects into pixel-level,generating pixel-wise segmentations giving the class of the object visible at eachpixel
In this thesis, we aim to solve the semantic segmentation problem Strictlyspeaking, object segmentation is not a well posed problem, since the definition ofobject is not quite accurate For example, a forest could be seen as an object as awhole, but individual trees could be localized and recognized at ease when viewedfrom a closer distance Or a shirt can be either considered as an individual object
or part of the human body Therefore, it is necessary to pre-define a subset ofpossible category labels to be assigned to every pixel In general, semantic objectsegmentation is a very challenging problem, mainly due to different poses, scales,positions, illumination, partial occlusion and large intra-class variety
Trang 19Previous methods often fell into the pipeline of over-segmentation [6–8], regionrepresentation [9, 10], and region labeling inference [11, 12] A test image is firstover-segmented into some coherent regions like super-pixels [6,8], then some local ormid-level features are extracted, like SIFT [13], HOG [14], LBP [15] to represent thelocal or mid-level regions, finally all of the features are fed into some classificationmodels like SVM or CRF for the final labeling inference [11, 16] Though suchpipeline has achieved significant progress in the past decades, the representationalpower of the local features are often limited without considering the higher-levelcontext information Since it has been proved that the classification, detection andsegmentation are indeed the same problem in different scales, from higher image-level classification, to mid-level detection till the pixel-level segmentation, they arequite correlated that each of them could provide useful contextual information forthe others For example, the classification results could provide image level labelsthat could narrow down the labeling space of the local regions, while detection couldprovide localization information that could enforce some spatial constraint to theneighboring local regions Furthermore, the segmentation results which are alignedwith the exact object boundaries can directly converts to the tight bounding boxinformation to refine the detection results, or provide better spatial support forfeature pooling in order to boost the performance of image classification.
In this thesis, we mainly focus on the effects of various contextual information
on visual object segmentation problems We explored various ways to obtain andutilize these contextual information, including global image classification, objectsegmentation, inter-object occurrence and object-background relationships Before
we delve into the details, we first give a comprehensive review about the historicalbackground of the related works
1.1 Historical Background
In this section, we first briefly review image classification and detection, then wepresent the historical background of image segmentation and semantic segmentation,
Trang 20finally we will introduce contextualization, including various methods to obtain andutilize the context information.
1.1.1 Image Classification
Image classification methods mainly fall into two categories, bag-of-words(BOW)based [10, 17–19] and deep learning based [20–23] Traditional BoW models usually
follow a standard pipeline, e.g , feature extraction, feature coding, feature
pool-ing, and classification For feature extraction, various low level features includingSIFT [13], HOG [14], LBP [15] are extracted either in a dense grid or from sparseinterest points Then different coding schemes like Vector Quantization (VQ) [24],
Sparse Coding [25], Linear Locality Coding (LLC) [26], Fisher Kernel [27] etc are
applied to code the features into a consistent representation, followed by some ing techniques like Spatial Pyramid Matching (SPM) [10], ScSPM [28], GeneralizedHierarchical Matching (GHM) [18] to generate the final global image representa-tion Finally for classification, conventional models include random forests [29] andSVM [30] are utilized to classify the global features into different categories Re-cently, beyond modeling the region features alone, some context information likespatial location of the object and background scene are integrated to improve theclassification accuracy, like [17,18] Although traditionally BoW based methods hasachieved great progress in different benchmarks like [1] and many of the papersare trying to improve this pipeline, the involved hand-crafted features requires greatskills to design and may not be optimal for different tasks, which significantly limitsthe generalization of such model across different datasets
pool-In contrast to the hand-crafted features, learnt features from deep neural works have shown great potential in various visual recognition tasks, especially clas-sification Deep learning methods try to extract the high-level abstraction for visualdata by multiple non-linear transformations, like convolution and max-pooling inthe state-of-the-art Convolutional Neural Network (CNN) It has demonstrated ex-traordinary power in a lot of database like ImageNet [31] Furthermore, it has beenproved that the CNN models pre-trained on a large datasets with enough data di-
Trang 21net-versity and category distribution, can be transferred to extract features for otherimage datasets possibly without enough training data [21–23, 32].
1.1.2 Object Detection
Similar to image classification, mainstream object detection methods also fall intotwo categories, sliding window based [14, 33–36] and deep learning based [37–39].Since the object of interest can be found at any location with different poses in theimage, sliding window has been the dominant paradigm for quite a long time Byexhaustively searching the sliding windows with different locations, scales and aspectratios, the detection problem becomes an image classification problem that tries to
determine whether the window contains the object of interest or not Dai et al.
[14] first proposed an efficient pedestrian detection algorithm based on HoG feature
Felzenszwalb et al [33, 34] extended to more complicated categories by introducing
the Deformable Part Based model that models the relationship between differentobject parts The DPM model has been the state-of-the-art for a long time in a lot
of challenging benchmarks, e.g [1] Many of the models [35, 36] followed similar
pipeline However, the performance of such methods has been plateaued in the pastfew years One possible reason is that only the information inside the bounding box
is modeled without taking consideration the surrounding context information, whileother reason is due to the limited representation power of the hand-crafted featureslike HoG For instance, HoG can well model rigid objects with large edge contrast,but not quite accurate for articulated objects with textures
Motivated by the impressive achievements of deep learning features in imageclassification tasks, many researchers tried to extend the deep neural network into
object detection [37–39] Szegedy et al [37] made the first step by proposing a
simple formulation of object detection as a regression problem to object bounding
boxes based on CNN networks Later, Erhan et al [38] proposed a saliency-inspired
neural-network model for detection, where a set of class-agnostic bounding boxesalong with the score of confidence are predicted The model can handle a variety ofinstances for each class and allows for cross-class generalization at the highest levels
Trang 22of the network However, the performance of the above models are just comparative
to the state-of-the-art DPM based models Girshick et al [39] applied a proposal
based CNN framework pre-trained on the ImageNet classification dataset [31] toextract CNN features from a large set of object proposals [40], and used linear SVM
to classify, followed by some bounding box regression to refine the final detectionresults The method achieved significantly superior performance than all of thestate-of-the-art methods, demonstrating the potential of such CNN-based methods
1.1.3 Image Segmentation
Image segmentation, aiming to segment an image into several coherent regions,has a very long history One of the first methods was published more than 40years ago [41] It iteratively merge small patches with similar gray-level statisticsfrom a seed patch, till none of the neighboring patches are sufficiently similar to thecurrently merging region This method is quite intuitive that taking advantage of one
of the fundamental grouping heuristic that neighboring pixels having different colorstend to belong to different objects, however, it could not handle complicated patternslike texture Afterwards, Mean Shift [42], Normalized Cuts [43] and Graph BasedSegmentation method [8] are provided to generate better segmentation results Suchmethods tend to generate a single optimal segmentation that covers the whole image
in a non-overlapping manner, which is quite difficult due to the ambiguity of thelow and mid-level cues
Some other methods generate multiple segmentations from an image, either in
an independent manner or organized in a hierarchy Russel et al [44] proposed to
discover objects by computing Normalized Cuts for different number of segmentsand image sizes Malisiewics [45] tried to merge pairs and triplets of segments ob-
tained from Mean-shift [42] or Normalized Cut [43] Rabinovich et al [46] selected
reoccurring segments as they are potentially more stable For the hierarchical
seg-mentation, Sharon et al first generateed and combined multiscale measurements of
intensity contrast, texture difference and boundary integrity [47], and later [48] posed a multi-grid normalized cut technique to generate segmentation result from
Trang 23pro-multiple level of granularity Arbel´aez et al [49] produced a segment hierarchy,
called ultrametric contour map (UCM), by iteratively merging superpixels based onthe learned globalPb boundary detector [50] The hierarchy is a very robust repre-sentation since image contents are intrinsically compositional, however such singlehierarchy structure is prone to errors because errors in one level tend to propagate
to all the coarser levels
In order to obtain more coherent spatial results, more statistics informationabout the real world images should be leveraged Some salient object detection
algorithms [51,52] tend to generate segmentation based on human attention Ren et
al [53] used a classifier to distinguish good segmentations that are generated by
combining different superpixels Endres et al [54] applied a learned affinity measure
to group superpixels into more meaningful regions based on a structured learning
approach Levenshtein et al [55] developed a figure-ground segmentation based
on parametric max-flow principles Similarly, Carreira et al [56] proposed a
non-parametric min-cut algorithm to generate a large set of object hypotheses
1.1.4 Semantic Segmentation
While image segmentation tries to generate some meaningful regions, semantic mentation will further assign category labels to each region Generally, semanticsegmentation methods are either bottom-up or top-down
seg-Bottom-up Approaches
Bottom up approaches extract various low and mid-level image features (e.g SIFT [13],
HOG [14], texton [57] ) and try to find homogeneous segments based on the image
cues A common approach is to use graphical representation [47,58], where the nodes
represent pixels or super-pixels, and the graph is partitioned into several subgraphscorresponding to different regions In [11], a large pool of figure-ground hypothe-
ses are generated by solving constrained parametric min-cut (CPMC) [56] problems
with various choices of the parameter The hypotheses are ranked and classified bysupport vector regressors (SVR) based on their “objectness” Analogous to aver-
Trang 24age and max-pooling, second order pooling (O2P) is proposed in [9] to encode the
second order statistics of local descriptors By employing this pooling technique asignificant improvement can be achieved In [59] a composite statistical inferencemethod (CSI) is applied to the original CPMC method Generally, CPMC-basedmethods [9, 11, 56] alleviate the problem by exploiting object-level segments thathave high overlap with ground truth objects However, the inter-segment relation-ships and the segment background information are generally not very well modelled,especially for visually confusing categories Hence, they still cannot guarantee theperfect classification and ranking of the segments K¨uttel et al [60] proposed a
figure-ground segmentation framework, in which the training masks are transferred
to object windows on the test image based on visual similarity Then, these masksare used to derive appearance and location information for graph-cut-based mini-mization In [61], similar idea is proposed and a class-independent shape prior isintroduced to transfer object shapes from an exemplar database to the test image.This prior information is enforced in a graph-cut formulation to obtain figure-groundsegmentation Generally, bottom-up methods provide visually coherent segments,nevertheless the main difficulty of these methods is that the local information is usu-ally ill-posed Therefore, local methods without modelling objects globally tend togenerate visually consistent segmentation instead of semantically meaningful ones
Top-down Approaches
Top-down approaches generally rely on acquired class-specific information, e.g class label, object bounding box [60, 62] and shape model [63] Brox et al [64] applied so-called poselets to predict masks for numerous parts of an object The pose-
lets are aligned to the object contours, and then they are aggregated into an ject Arbel´aez et al. [65] proposed region-based object detectors that integratetop-down poselet detector and global appearance cues This method [65] producesclass-specific scores for the regions and aggregates multiple overlapping candidatesthrough pixel classification in order to get the final segmentation results A unifiedprobabilistic framework based on supervised learning is presented in [66] For the
Trang 25ob-local appearance model Fisher kernel is used and the segmentation process is guided
by low-level segmentations which enforce local and global image-level consistency.Another semantic segmentation method was presented in [67], where sparse coding
is introduced as a high level descriptor of the regions, which contributes to less tization error than traditional bag-of-word (BoW) method Shape prior can also be
quan-a useful top-down guidquan-ance The work in [68] models quan-a cquan-ategory of shquan-apes quan-as quan-afinite dimensional manifold, the shape prior manifold which are approximated fromthe shape samples using the Laplacian eigenmap technique Rathi [69] proposed touse a projection method from the shape space to the manifold, while Walder [70]constrained the mapping between the shape space and the lower dimensional man-ifold space to be a diffeomorphism which could ensure the pairwise distance to bepreserved [63,71] proposed frameworks to do object segmentation using shape priorwith graph cut In the case of top-down methods the main challenge is to obtain theobject templates, especially for objects with relatively large intra-class appearanceand pose variations
incor-and global occurrence information [12] to guide the graphical inference Boix et
al [16] introduced the so-called “harmony potential”, which integrates global
cate-gory label information as well as object detectors in order to better fuse global andlocal information Although CRF-based models have a strong generalization ability
to integrate different cues, the modelling and training of these kinds of methods arerelatively difficult due to the expensive estimation of the partition function
Trang 261.1.5 Obtaining Contextual Information
The importance of context to obtain good classification and detection has been nessed in many papers [17, 73] In practice, context can be any information notdirectly produced by the appearance of an object In many cases, local appearance
wit-is not enough to correctly classify a local image region, while context wit-is able toprovide useful guidance to disambiguate [17, 73] introduced some contextualizationtechnique between detection and classification Numerous contextual cues, like theglobal scene layout [74], object detection [64, 65, 75, 76], as well as the interactionbetween objects and regions [77–80] are integrated in a variety of visual recognitionmodels The method proposed by Cinbis and Sclaroff [80] makes use of relativelocations and scores between pairs of detections Tu and Bai proposed a learning
algorithm for high-level vision problems (e.g medical image segmentation), called
auto-context [81] It [81] first learns a classifier on local image patches, and then
the discriminative probability maps are fed into the new classifier as contextual formation to boost the performance Bul´o et al [82] introduced structured local
in-predictors to exploit contextual relations from complex interactions between labelsand intermediate representation of the image data in a convex structured learningproblem The above contextualization are mostly based on detection and classifica-tion In this work we will comprehensively explore how to utilize different contextcues to help the object segmentation task
1.2 Thesis Focus and Contributions
Since the topic of this thesis is semantic object segmentation, and it has been provedthat image classification, object detection and semantic segmentation are highlycorrelated, however, most of the contextualization are between detection and clas-sification Currently the research gaps in context based segmentation are as follows
• Most of the current methods are still based on the local features, without
considering the contextual information around the regions or in a higher scale
Trang 27level Usually the relations between the object of interest with other regions
(e.g background) or higher-level image are not well modeled.
• Detection provides a very important top-down guidance in segmentation but
is not fully utilized
• Since segmentation is pixel-wise labeling, the annotation burden is much
heav-ier than classification and detection, thus how to train a good enough modelbased on the limited training data is a great challenge Furthermore, most
of the benchmarks i.e [1] contains large portions of unlabled background
re-gions without explicit labeling, how to exploit such clutterness to boost thesegmentation of the object of interest is also very important
To narrow down the above research gap, the thesis proposed a context basedobject segmentation system, aiming to investigate the effects of different contexts,especially detection and unlabled background regions Here we mainly list the maincontributions of the thesis:
• Detection based sparse reconstruction: we first present one of the main
con-tribution of the thesis: a unified framework for detection based segmentationvia coupled global and local sparse representations The chapter begins withthe formulation of such coupled sparse reconstruction framework based on thepredicted bounding boxes from object detectors We then provide an efficientalternating optimization scheme to jointly pursue the optimal latent maskand the sparse reconstruction coefficients and some post-processing technique
to generate the final segmentation task The chapter concludes with sive results on various benchmark datasets like PASCAL VOC segmentationdataset [1] and Weizmann Horses Dataset [2] with detailed comparison withthe state-of-the-art methods
exten-• Segmentation without annotating segments: After we introduced the
super-vised segmentation framework based on sparse reconstruction, motivated bythat the segment annotation is usually much more difficult to obtain than
Trang 28object bounding boxes In Chapter 3, we proposed another detection basedsegmentation approach without any additional segmentation annotation fromeither the training set or user interaction We first introduced a shape guidedMarkov Ramdom Field(MRF) model, and then we introduced how to obtainsuch shape guidance from a set of object segment hypothesis by a simple votingscheme Then we apply the derived shape guidance into the subsequent graph-cut-based figure-ground segmentation Final segmentation result is obtained
by merging the segmentation results in the bounding boxes
• Background context augmented hypothesis graph: besides global classification
and detection, we further explore how to obtain the contextual cues from theunlabeled background regions that are usually ignored We first introduced
a fully connected CRF model that allows integration for different contextualcues The CRF model is considered over a set of overlapping CPMC ob-ject hypothesis [56] Then the background contextual cues are learned in aweakly-supervised manner and applied in the unary term of the correspondingforeground regions Then final segmentation result is obtained via maximum-a-posteriori (MAP) inference, where the segments are merged based on a se-quential aggregation manner
1.3 Organization of the thesis
The remaining parts of the thesis are organized as follows: Chapter 2 introduces thedetection based segmentation methods via optimal sparse reconstruction, Chapter
3 presents the learning-free pipeline of segmentation without annotating segments,while Chapter 4 introduces an integrative hypothesis graph augmented by back-ground context and other contextual information Finally, in Chapter 5, we providesome discussion about some promising future directions and the conclusion of thewhole thesis
Trang 291.3.1 Relevant Publications
The main material in this thesis comes from some journals and conference ings The relevant publication of each chapter are listed as follows:
proceed-Chapter 2:
• Wei Xia, Zheng Song, Jiashi Feng, Shuicheng Yan, Loong Fah Cheong
Seg-mentation over Detection by Coupled Global and Local Sparse tions In ECCV, Firenze, Italy, Oct 7-13, 2012 [83]
Representa-• Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan Segmentation
over Detection by Optimal Sparse Representations In IEEE Transactions onCircuits, Systems and Video Technology (TCSVT) 2014 [84]
Chapter 3:
• Wei Xia, Csaba Domokos, Jian Dong, Loong Fah Cheong, Shuicheng Yan.
Semantic Segmentation without Annotating Segments, in ICCV, 2013 [85].Chapter 4:
• Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan Background
Context Augmented Hypothesis Graph for Object Segmentation In IEEETransactions on Circuits, Systems and Video Technology (TCSVT) 2014 [86]
Trang 30Chapter 2
Segmentation over Detection via Optimal Sparse Reconstructions
In this chapter, we address the problem of semantic segmentation, where the
possi-ble class labels are from a pre-defined set We exploit top-down guidance, i.e the
coarse localization of the objects and their class labels, provided by object detectors.For each detected bounding box figure-ground segmentation is performed and thefinal result is achieved by merging the figure-ground segmentations The main idea
of the proposed approach, is to reformulate the figure-ground segmentation lem as sparse reconstruction pursuing the object mask in a non-parametric manner.The latent segmentation mask should be coherent subject to sparse error caused byintra-category diversity, thus the object masks are inferred by making use of sparserepresentations over the training set In order to handle local spatial deformations,local patch-level masks are also considered and inferred by sparse representationsover the spatially nearby patches The sparse reconstruction coefficients and thelatent mask are alternately optimized by applying the Lasso algorithm and the Ac-celerated Proximal Gradient method The proposed formulation results in a convexoptimization problem, thus the global optimal solution is achieved In this chapter
prob-we provide theoretical analysis of the convergence and optimality We also give anextended numerical analysis of the proposed algorithm and a comprehensive com-
Trang 31parison with the related semantic segmentation methods on the challenging CAL VOC object segmentation datasets and the Weizmann-Horses Dataset Theexperimental results demonstrate that the proposed algorithm achieves competitiveperformance comparing to the state-of-the-arts.
PAS-2.1 Introduction
Localizing and recognizing objects efficiently in a complex visual scene is one ofthe amazing capabilities of human cognitive system Thus to investigate how it ispossible to emulate such capabilities by computers has attracted lots of interest inthe field of computer vision and machine learning [11, 66, 81, 87] The core sub-tasks of this area are object classification, detection and segmentation [10, 11, 87].Classification tells whether an image contains a certain object or not, detectionlocalizes the object by providing its bounding box, while segmentation aims to
assign class labels to each pixel A special case of segmentation is called semantic
segmentation, where the possible class labels are from a pre-defined set In this
chapter we focus on semantic segmentation
The recent segmentation methods mainly fall into two groups The first one
involves bottom-up methods, which first extract various low and mid-level image features (e.g texton, SIFT, HOG) and try to find homogeneous segments based on
those image cues [11] A common approach is to construct a graphical representation
of the problem [7], where the nodes represent pixels or super-pixels and the edgesconnect neighbouring nodes, and then the graph is partitioned into several sub-graphs corresponding to different object regions In general, bottom-up approachesprovide visually coherent segments, but the main difficulty is usually due to ill-posed
local information The other group of segmentation methods includes top-down
approaches, which generally rely on acquired class-specific information, e.g class
label, object bounding box [60, 62] and shape model [63] In the case of top-downmethods the main challenge is to obtain the object templates, especially for objectswith relatively large intra-class appearance and pose variations
Trang 32Both bottom-up and top-down approaches suffer from some intrinsic drawbacks,thus some methods [12, 16, 72] utilize the mixture of those approaches Ladick´y et
al [72] proposed a multilevel hierarchical Conditional Random Field (CRF) model
to incorporate information from different scales, which is combined with top-downdetectors and global occurrence information [12] to guide the graphical inference
Boix et al [16] introduced the so-called “harmony potential”, which integrates global
category label information as well as object detectors in order to better fuse globaland local information Although CRF-based models have a strong generalizationability to integrate different cues, the modelling and training of these kinds of meth-ods are relatively difficult due to the expensive estimation of the partition function
2.1.1 Motivation and Contributions
Due to the intrinsic drawbacks of the purely bottom-up methods, we aim to exploittop-down guidance as well Motivated by the rising performance of object detectors,
in many applications detection is performed, which gives the coarse localization ofthe object by a bounding box Thus object detectors [35, 36] provide a useful top-down guidance However, they lack the accuracy to precisely identify the object atpixel level Intuitively, for each detected bounding box figure-ground segmentationcan be performed
Sparse representation has seen significant impact in several computer visionproblems, for example in face recognition [88], image classification and segmen-tation [28, 67], as well as in motion and data segmentation [89] Furthermore, it ispreviously verified [90] that sparse coding provides better results in finding relatedsamples for a given image than other reconstruction methods In practice, due tothe large variance of poses, colors and texture, the training dataset should be verylarge in order to handle such variations However, in our case, the object bound-ing box is available and it is normalized, hence the variance in scales and positionsare alleviated to some extent Motivated by the efficiency of sparse representation,the figure-ground segmentation problem is reformulated as sparse reconstructionpursuing the latent object mask in a non-parametric manner In this chapter, we