Context based visual object segmentation 3

by morphological hole filling and super-pixel refinement serving as post-processing.Moreover, by incorporating other kinds of contextual cues, like global image classi-fication and objec

Trang 1

by morphological hole filling and super-pixel refinement serving as post-processing.Moreover, by incorporating other kinds of contextual cues, like global image classi-fication and object detection cues, new state-of-the-art performance is achieved byour proposed solution as experimentally verified on the challenging PASCAL VOC

2012 and MSRC-21 object segmentation datasets

Trang 2

Over the past few years, various approaches have been proposed to solve thisproblem The bottom-up segment ranking approaches generate a large pool of objecthypotheses The regions are then scored and ranked based on their “objectness”,and finally they are combined to obtain the final segmentation [9, 11, 65] However,the inter-segment relationships and the segment background information are gener-ally not very well modelled, especially for visually confusing categories Hence theseapproaches still cannot guarantee the perfect classification and ranking of the seg-ments The detection-based methods [64, 83, 85] utilize top-down guidance obtainedfrom object detectors and refine the coarse object localization within the predictedbounding boxes The main shortcoming of those methods is that the poor detectionresults or mis-detection will deteriorate the segmentation performance especially inthe case of interacting objects Besides, there are also some methods which considerthe graphical representation of the problem, where the nodes represent pixels orsuper-pixels, and the graph is partitioned into several sub-graphs corresponding todi↵erent object regions These models are very prosperous in semantic object seg-mentation, which mainly consider a conditional random field (CRF) [16,72,104,106].Although these methods have a great generalization ability, the main bottleneck, asproved in [107], is the lack of rich features that can discriminate the local patchesfrom similar categories Since the breakthrough of deep learning in image classifica-tion [20], great progress has also been witnessed in other visual recognition tasks,like in semantic segmentation [39, 108, 109] Recently, it is a popular approach touse convolutional neural network (CNN) trained from raw pixels in order to extractfeature vectors which have a great representation power.

Trang 3

In this chapter, we propose a CRF model based on a fully connected segmenthypothesis graph in order to incorporate the interaction between the segments It ismotivated by the observation that when the hypotheses are classified independentlywithout considering the inter-segment relationship as well as other high-level cues,

it is hard to distinguish some confusing classes [11] On the other hand, using thesemantically more meaningful segment hypotheses as the nodes for the CRF modelwill in turn alleviate its low discriminating power among local patches Furthermore,the unlabeled regions (i.e background) are often discarded, but they may contain

a large portion of the pixels For example, in the PASCAL VOC 2012 TrainValsegmentation dataset [1], 69.3% of the pixels belong to the background Li et al [77]showed that the background regions actually contain useful contextual informationfor accurate recognition For instance, plane and bird are more likely to occur

in the presence of the sky background (see Fig 4.1) Intuitively, the contextualinformation obtained by learning the relationship between the foreground objectsand their background regions of interest can augment the CRF model The maincontributions of this work are summarized as follows:

• We propose a CRF-based solution over a hypothesis graph that utilizes therelationships between the overlapping object-level segments which are moresemantically informative than disjoint local regions, like pixels and super-pixels (see Section 4.3) A fully connected graph is also employed to enhancethe interaction between the segments

• Obviously, with more annotated data, the learned model will be more rate Nevertheless, in many situations, there does not exist a large set of train-ing data Therefore, we also propose a novel background-aware approach tohelp re-score the unary term for segment hypotheses by extracting contextualcues from the background regions without explicit labelling of the backgroundcategories (see Section 4.3.2), which alleviates the annotation burden to thecluttered background categories Moreover, due to the great generalizationability of the proposed CRF model, various contextual cues, like global im-

Trang 4

accu-Figure 4.1: Illustration of the role of background context information (e.g sky orindoor) In many cases it can help recognize the objects (e.g the bird instead ofboat or the potted plant instead of tree).

age classification and object detection cues can easily be integrated into ourmethod to further boost the performance

• We conduct a comprehensive analysis to verify the roles of di↵erent contextualcues and the improvement provided by the proposed background context (seeSection 4.5.1) And we demonstrate the superiority of the proposed methodover the state-of-the-arts in benchmark datasets like PASCAL VOC [1] andMSRC-21 [4] (see Section 4.5.2)

where figure-ground hypotheses are generated by solving the constrained parametricmin-cut (CPMC) [56] problem with various choices of a parameter The hypothesesare then ranked and classified using Support Vector Regression (SVR) [11] In [59]

a generative model is introduced, which maximizes the composite likelihood of theunderlying statistical model by applying the expectation-maximization algorithm

Trang 5

Figure 4.2: Overview of the proposed solution First, a pool of object hypotheses aregenerated A fully connected hypothesis graph is then built to model the relationshipbetween the possible overlapping segments A novel background contextual cue ispredicted for the segments via sub-category classifiers The scores are fed into theCRF model together with other cues like image classification and object detection.Finally, the coarse segmentations obtained via MAP inference are merged and post-processed to achieve the final segmentation result.

to encode the second-order statistics of local descriptors inside a region The ments generated by CPMC [56] generally have a very high overlap ratio with the

achieving the state-of-the-art performance Compared with our proposed work, however, those methods mainly focus on the foreground hypothesis segmentsand ignore the background regions that might be informative, furthermore, no inter-segment relation is modelled

to incorporate the information from di↵erent scales, like object detectors in [12] andobject occurrence in [104] Boix et al [16] also incorporated the global classification

as an extra higher order potential, called harmony potential, in the CRF tion Yadollahpour et al [107] introduced a two-stage approach by discriminatively

formula-1 The maximum overlap with the ground-truth is 81.2% in average on the PASCAL VOC 2012 TrainVal dataset [1].

Trang 6

re-ranking the M -best diverse segmentations obtained by the CRF model erally, those approaches utilize di↵erent contextual cues to help classify the localpatches, which are intrinsically not as discriminative as the object-level hypothesesused in our framework If the local classification is not accurate enough, many mis-labelling cannot be recovered even with carefully designed optimization algorithms.Ion et al [99, 110, 111] also considered a CRF model over the set of possibly over-lapping figure-ground hypotheses Given a bag of figure-ground segmentations, ajoint probability distribution is provided over the compatible image interpretation

Gen-as well Gen-as the labellings of the composite tilings, which are cGen-ast Gen-as sets of mal cliques Some contextual information like the pairwise compatibilities amongthe spatially neighboring segments are also modeled However, in contrast to ourmethod, in [99, 110] only the valid compositions of hypotheses are used (i.e themaximal clique tilings have no spatial overlap), but we consider all the generatedsegments in a fully connected CRF model Furthermore, there is no contextualmodeling from the unlabeled background regions as well as the global classificationand detection cues as compared in the proposed model

multi-scale CNN to extract dense feature vectors that captures texture, shape and tual information By making use of such representation, multiple post-processingmethods are applied (e.g CRF model) to produce the final labelling from a pool

contex-of segmentation components In [109] a recurrent CNN is proposed for scene belling allowing for a larger input context while limiting the capacity of the model.Trained in an end-to-end manner over raw pixels that is yet not dependent on anysegmentation technique or any task-specific features, the system could identifies andcorrects its own errors, leading to the state-of-the-art performance in several scenelabelling benchmarks Girshick et al [39] applied a CNN framework pre-trained onthe ImageNet classification dataset [31] to extract features on the object segmentproposals, and used linear SVM to classify the segment proposals Although theCNN-based features have a great representational power compared to hand-crafted

Trang 7

la-features, those models usually have millions of parameters to learn thus they requiretremendous quantity of annotated training data, which is quite difficult to obtain

in some cases

trans-fer model for scene labelling, to transtrans-fer and warp the annotations in the training set

to a test image by matching dense SIFT flow between the training and test samples.Tighe and Lazebnik [113] presented another non-parametric approach by matching

a test image against the training set, followed by super-pixel level matching andMarkov random field (MRF) optimization to incorporate the neighborhood context.Myeong and Lee [105] applied higher-order semantic contextual relationships be-tween the objects in a non-parametric manner In [114] the relevance of individualfeature channels is learned by using a locally adaptive feature metric based on smallpatches and simple gradient, color and location features

and the interaction between objects and regions [77–80], were successfully integrated

in object recognition frameworks In [74], a holistic CRF model is presented, whichintegrates di↵erent levels of contextual cues like scene labelling and detection Li et

al [77] extracts contextual cues from the unlabeled regions in order to boost thetraditional object detection Heitz and Koller [78] proposed a probabilistic “thingsand stu↵” model to consider the contextual relationship between the regions anddetected objects to boost the detection performance The method proposed byCinbis and Sclaro↵ [80] makes use of relative locations and scores between pairs

of detections In [115], stacked sequential scale-space Taylor coefficients are posed to gather contextual information by sampling the posterior label field sequen-tially, which achieved the state-of-the-art performance in MSRC-21 benchmark [4]

pro-In [79], the context information is obtained in a supervised manner However, thebackground annotations are generally very difficult and time-consuming to obtain

in practice due to the huge clutterness Although, employing the background

Trang 8

infor-mation to help classify the foreground objects is not a new idea, the novelty of ourapproach mainly lies in how the background context (BC) information is obtained:

in contrast to the previous methods, it is extracted from the background regionswithout knowing the exact labelling of the background categories

applying the method proposed in [11,56], which provides visually coherent segments

a finite predefined label set The contextual information of the background regionsfor each class, called background context (see Section 4.3.2), as well as other kinds ofcues is also extracted and applied to augment the unary term After calculating theoptimal labelling for the segments, they are projected back to I and are merged intothe final segmentation result followed by some simple post-processing techniques(see Section 4.3.3) The proposed pipeline is shown in Fig 4.2

4.3.1 CRF-based Formulation

Here, the graphical representation of the labelling problem is briefly introduced We

P

Trang 9

the local confidence of the label xu 2 L for the segment Su uv(xu, xv) is the

nodes The goal is to find the optimal labelling:

Unary term In this term we incorporate di↵erent kinds of cues For this sake,

negative log-likelihood is applied [16]:

w(xu )

validation set (for more details please refer to Section 4.4)

106]:

˜ n

X

j=1

is the number of the involved kernels defined as:

Trang 10

Figure 4.3: Exemplar sub-category clusters for the horse category from the PASCALVOC 2012 TrainVal dataset [1] Each row shows some images with a certain sub-category It is observed that each cluster shares significant consistency among boththe foreground horse objects and the background regions.

pairwise term a very efficient inference can be performed as shown in [106]

seg-ments that are assigned di↵erent labels, but it is insensitive to compatibility between

also considers the interactions between labels

4.3.2 Background Context Modeling

In this section, we introduce how we model and obtain the contextual informationfrom unlabeled background regions in a weakly-supervised manner, which is used

in the unary term in (4.3) Assume that we are given a set of training images

single indexing by i for all the ground-truth regions as

˜ m

Trang 11

Motivated by the large pose, scale and location variance of the objects, we searchfor the sub-category clusters containing objects of interest with similar appearanceand pose To this end, we follow the sub-category learning method proposed in [116]

classifier is formulated as the following optimization problem that minimizes thetrade-o↵ between the regularization term and the hinge loss [116]:

arg min

w1 , ,wt

12

truth label is calculated as:

sub-category learning, for all images sharing the same sub-sub-category label relating to agiven class, there is a significant appearance consistency among both the objects ofinterest and the background regions (see Fig 4.3)

Our main purpose is to learn the background context (BC) for a given

and k, respectively) corresponding to ` are fixed We can train any o↵-the-shelf

Trang 12

classifier based on the feature vectors extracted from the training images over the

is proposed in [9] and use a simple linear SVM classifier The background regionsrelating to l and k are considered as the positive examples, while all the imagesthat do not contain any region with the pair of given class and sub-category labelsare considered as negative examples In order to achieve balanced positive/negativesample ratios, random sub-sampling is applied to the negative examples for a certainsub-category

for every category, we need to take into consideration a technique to combine thescores across the t sub-categories Simply, we can take the average or the maximum

4.3.3 Merging and Post-processing

After the segments are labeled by the CRF framework via maximum-a-posteriori(MAP) inference, we need to merge them in order to generate the final segmentation

Trang 13

result by labelling each pixel on I We project the segments back into the originalimage I and aggregate them by adopting the procedure described in [11] More

subsequent aggregation First, the segment with the maximum score is selected asthe seed Other segments that have a larger intersection over union measure ((4.9))

i.e the scores for every pixel are calculated as the weighted sum of the scorescorresponding to the segments it belongs to The figure-ground mask for the class

non-maximum suppression, all the segments selected in the above aggregation stage aredropped Then the aggregation process is continued by selecting the next seedthat has the maximum score among the remaining segments Finally, the masksfor all the di↵erent categories are obtained For interacting masks from di↵erentcategories, the interacting pixels are assigned with the class labels that have themaximum scores

In order to eliminate some artifacts in the obtained “coarse” segmentation result,morphological hole filling is applied as post-processing In general, one can improvethe result by using super-pixels to refine the object boundaries Therefore, for thetest image I on average 400 super-pixels are extracted using the method [6] Weremark that they generally align quite well with the actual object boundaries In

in the coarse segmentation for a given category, then its label is set to the same classlabel of the underlying coarse region Through the chapter we use these thresholdingvalues, which were set experimentally (see Section 4.5.1)

In contrast with the methods proposed in Chapter 2 and Chapter 3 for whichthe label of the interacting regions mainly depends on the detection scores, the se-quential aggregation algorithm based on the optimal unary score is more robust.Since most of the CPMC segments share a large portion with the ground-truth ob-

Trang 14

jects, and the MAP labeling from the CRF model has taken into consideration of allthe di↵erent contextual information including detection So the scores for the seg-ments are more representative or accurate than simply relying on object detection.Therefore, images with multiple interacting objects can be better handled.

di↵erent levels of information:

object-level hypothesis score by linear SVR as described in [9]

• The BC cues are extracted as described in Section 4.3.2, and according to theexperiments in Section 4.5.1, we choose t = 15 sub-categories for every class

• We also use the global image classification cue In order to obtain it, weapply the method presented by Chen et al in [18], which provides a set of

• Another quite informative cue is provided by object detection Based on themethod presented in [33], we generate the confidence map over I for everycategory using the filter responses and then the detection cue for a segment isobtained by computing the average of the pixel level confidence scores

defined in (4.4) In fact, the extracted features and the spatial location of the

representation which contains various low level features (e.g SIFT, HOG) for each

Trang 15

contextual cue The applied kernels are

2

2 f

⌘,

2

2 c

⌘

pairwise term enforces the nearby overlapping segments with similar appearances

as in [116] First, the latent SVM for sub-category mining (see (4.6)) is trained over

uj , b(xu ) uj

of the model on the training set [16] For the L-BFGS optimization, the gradient

of the function Z also needs to be estimated [16] For the inference, since our fullyconnected CRF model is defined on a graph with Gaussian edge potentials, we canperform a very efficient technique proposed in [106], which is based on mean-fieldapproximation

and 8GB RAM, it takes on average 5 minutes on single thread to generate a largepool of CPMC segments, which is the most time consuming part of our pipeline

cue extraction takes less than 0.2s, since it only requires a linear SVM inference.The CRF inference for segment labelling is also very fast with an average time of

Trang 16

Table 4.1: The e↵ects of di↵erent pooling methods with various numbers of categories (t) for BC modeling in terms of the average IoU accuracy, defined in (4.9),

sub-on the PASCAL VOC 2012 TrainVal dataset [1]

t = 1 t = 5 t = 10 t = 15 t = 20 Average-pooling 35.5 45.1 45.9 46.1 45.2 Max-pooling 35.5 45.6 46.1 46.5 46.1

0.2s, due to the efficient inference method proposed in [106] Finally, the extraction

of global classification and detection takes around 1s and 2s, respectively, per-image.Note that most of the parts of the algorithm can be computed in parallel and o↵-line,hence further speed-up can be easily achieved

In order to evaluate the performance of the proposed method, we conduct iments on the latest PASCAL VOC 2012 object segmentation dataset [1] whichconsists of 20 object classes Due to the large intra-class variability and objectinteraction, this dataset is among the most challenging ones in the semantic seg-

objects are contained per image For quantitative evaluation, the intersection overunion (IoU) measure is used

Here, we analyze the impact of di↵erent algorithm parts of the proposed method

To this end, a series of experiments are conducted on the TrainVal dataset [1] whichcontains 1464 images for training and 1449 validation images for testing

Trang 17

E↵ects of the sub-category number t

Obviously, the quality of sub-category clustering a↵ects the e↵ectiveness of the posed solution We conduct experiments, where only the BC cues are used in theCRF model The value of t, i.e the number of the sub-categories, is chosen from

the BC scores across the sub-categories belonging to the same class The obtainedresults are reported in Table 4.1 In the case of t = 1, i.e no sub-category mining

at all, the performance is quite low This shows that a simple classification scheme

is unfeasible to model the background regions, since they are usually quite cluttered

in this dataset It can be observed in Table 4.1 that max-pooling works better thanaverage-pooling, since it provides higher response for the most distinctive contextregions with respect to the object-of-interest However, the average-pooling takesinto consideration noisy and less-informative sub-category regions, too We foundthat the best result occurs at t = 15 In the subsequent experiments, therefore, wefix t = 15 and apply max-pooling to combine the BC scores

E↵ects of the post-processing parameters

set empirically via grid-search in the validation set The estimation of these three

too small, then most of the slightly overlapping segments will be suppressed Based

Trang 18

0.45

0.55 0.65 0.75 0.85 0.95

τ2

0.05 0.15 0.25 0.35 0.45

VOC 2012 TrainVal dataset [1]

Trang 19

Table 4.2: The e↵ects of di↵erent CRF models in terms of the average IoU accuracy,defined in (4.9), on the PASCAL VOC 2012 TrainVal dataset [1].

Without pairwise connection Partially connected Fully connected

E↵ects of the CRF model

Since one of the novelties of the proposed method lies in the fully connected CRFmodel over overlapping object-level segments, here we conduct experiments to val-idate the superiority of the proposed CRF model against the traditional one Thecomparison is shown in Table 4.2 Our baseline method is a CRF model over non-overlapping segments without considering the neighborhood relations (by settingthe pairwise term to zero) For the non-overlapping segments, we generate the mid-level segments based on the method proposed in [8]; while for the partially connectedCRF model, we consider the K = 8 nearest neighbors based on the distance betweenthe segment centroids From Table 4.2, it is easily observed that the overlappingmodels achieve better performance than the non-overlapping ones Adding connec-tions in the CRF model will also increase the segmentation accuracy Compared

to the baseline method, the fully connected CRF model over overlapping segmentsachieves 2.32% of improvement Furthermore, comparing the e↵ects of the abovetwo factors, it is observed that the overlapping segments have greater boost of per-formance than simply increasing the connectivity of the CRF model The reasonmight be that the generated segments constituting object level hypotheses are atleast partially aligned with the real object boundary and the error or mis-labelling

of a certain individual segment can be suppressed by merging multiple overlappingsegments

E↵ects of di↵erent contextual cues

We also evaluate the e↵ects of di↵erent contextual cues (denoted by CLS, DETand BC for the global image classification cues, object detection cues and the back-ground context cues, respectively), which are applied in our hypothesis graph The

Trang 20

bi ke (2 3.85)

bi rd (43 72)

bo

at (42.06)

bo ttle (41.41)

bu

s (6 4.73)

ca

r (6 5.01)

ca

t (5 6.34)

ch air (11.89)

co

w (39.11)

ta ble (19.25)

do

g (42 62)

ho rse (38.37)

m/

bik

e (52.24)

pe rso

n (48.75)

pl ant (3 3.41)

sh ee

p (45.05)

so

fa (27.29)

tra

in (5 8.79)

tv (4 8.69)

av

g (45.43 )

CLS,DET BC Full W/o pairwise term

Figure 4.5: The improvement of the IoU accuracy on the PASCAL VOC 2012TrainVal dataset [1] after separately applying supervised cues (i.e CLS and DETcues), the weakly-supervised BC cues, and their combination referred to as the Fullmodel The results obtained by using the unary term exclusively are also shown.The baseline accuracy is shown in parentheses

Figure 4.6: Qualitative illustration for the e↵ects of di↵erent contextual cues, likeimage classification (CLS), object detection (DET) and the background context(BC) When all the three cues are applied, we refer to it as the Full model Theresults corresponding to the last column are obtained by using the unary termsexclusively in the CRF model The results are overlaid with white boundaries anddi↵erent categories are highlighted with di↵erent colors

CRF model without using any contextual cues is considered as the baseline here

average 0.38% of improvement This shows that the hypothesis graph modeling therelationship between the segments is able to increase the performance We apply su-pervised CLS and DET cues only in the CRF model, and then we test our method

by employing the background-aware BC cues exclusively Moreover, we combine

Trang 21

Table 4.3: State-of-the-art comparison in terms of IoU accuracy, defined in (4.9),obtained on the PASCAL VOC 2012 Test dataset [1] The results in parenthesesare evaluated on an extended training dataset containing extra annotation pro-vided in [3].

tv 47.8 53.5 46.8 44.7 (44.5) 43.3 (43.1) 37.6 (39.2) 49.3 (47.5) avg 47.3 48.0 48.1 44.8 (46.7) 45.4 (46.8) 47.0 (47.5) 48.6 (49.0)

all the three di↵erent cues, referred to as the Full model The detailed results arepresented in Fig 4.5 and Fig 4.6 qualitatively illustrates each case The CLS andDET cues improve the performance on average 0.82% comparing to the baseline.Although CLS and DET cues are quite informative, for some categories (e.g mo-torbike and tv) the performance is worse than the baseline due to inaccurate imageclassification and object detection We have a slightly better average improvement

of 1.03% by applying the BC cues exclusively

As comparing di↵erent cues, the proposed BC cue is the most significant one.The Full model achieves the highest improvement for each category It provides animprovement of 1.72% on average We also perform our approach without using thepairwise term ( = 0) to check the impact of the full CRF model (see Fig 4.6), and

it results in 0.47% decrease on average against the Full model (see Fig 4.5), whichvalidates that the pairwise relationships have a substantial contribution

Note that generally the BC cues can improve the classification accuracy of theforeground objects, since the background cues not only can model the correlation

Trang 22

Figure 4.7: Exemplar cases when BC cue and the other contextual cues fail onPASCAL VOC 2012 Test dataset [1].

between the background and the foreground object, but also the inter-relationsbetween other labeled foreground objects and the object-of-interest In most cases,the t = 15 sub-categories can well model the correlation between the background andthe foreground object However, in some other cases, when the background is veryweird and has very low correlation with the foreground objects, the extracted BCcue will not be informative enough Note that all the contextual cues are integratedinto the CRF model as a soft-constrained manner, other contextual cues like theCLS and DET cues will somewhat alleviate the bias of the BC cue, avoiding thecontextual model from failing completely Two failure cases are shown in Fig 4.7.The first failure image shows the bottom of a car, which is not modeled by none ofthe background sub-categories, and the classification and detection also fail in such

a strange pose In the second image the contrast between the foreground objectboat and background is quite low resulting in the same difficulties Furthermore, thebackground class water looks very similar to grass, making the BC cue extremelyinaccurate Indeed, most of the current methods, including the state-of-the-arts [11,85] also fail in such cases

4.5.2 Comparison with the State-of-the-arts

VOC 2012

We train our model on PASCAL VOC 2012 TrainVal dataset and evaluate on theTest set which contains 1456 images [1] We compare our method with the state-

Trang 23

of-the-art approaches The detailed comparison of the proposed method with othertop-performing algorithms can be found in Table 4.3 Three of the competing

to as CMBR-O2P-CPMC-LIN uses linear SVR with second-order pooling [9] Themethod referred to as O2P-CPMC-CSI is mainly based on [59] and we refer to thecombined method of [99] and [9] as O2P-CPMC-FGT-SEGM It is worth notingthat almost all methods in this challenge are combinations of some previous meth-ods The winner method of the PASCAL VOC 2012 Segmentation Challenge is adetection-based approach, which first computes the optimal sparse representation ofthe training objects [83] and provides an initial segmentation mask for each bound-ing box These masks are then used in MRF formulation to obtain the final result.The method proposed by Xia et al [85] is also detection-based, which estimates ashape guidance for each object bounding box based on a set of generated CPMCsegments [56] Those shape guidances are applied in the subsequent MRF formu-lation From Table 4.3, it can be seen that the proposed method achieves the bestIoU accuracy in 6 out of 21 categories (including the background class) with thehighest average performance of 48.6% among the competing methods Comparing

out of 21 categories, which verifies that inter-segment relationships and some othercontextual cues can further boost the discriminative ability for the generated objecthypotheses

We also evaluate our approach using an extended training dataset, which isobtained from the PASCAL VOC object dataset by adding extra ground-truth an-notations provided in [3] As shown in Table 4.3, when our method is trained onthe extended set, it achieves 49.0% average segmentation accuracy, which is higherthan that of other competing methods Moreover, our proposed method providesthe best average accuracy in 14 out of 21 categories, which further validates the

Comparing with Xia et al [85], the method DET3-proposed in Chapter 3, though the computational cost of the proposed method is a little bit higher than

Trang 24

al-Figure 4.8: Exemplar segmentation results on the PASCAL VOC 2012 Testdataset [1] The results are overlaid with white boundaries and di↵erent categoriesare highlighted with di↵erent colors Some failure cases due to wrong labellingand/or missed prediction are shown in the last column For instance, the dog body

is wrongly labeled as cat in the first row; the cloth is labelled as human due to thelow scores The bird is missed, since its large portion is occluded The bus and thebottle in the last two rows are heavily occluded and the contrast is also too low.DET3-proposed in Chapter 3, the average performance across 21 categories (includ-ing the background) is increased from 48.0% to 48.6% Considering that DET-3

is the combination of several detectors, while the proposed approach only uses thesingle state-of-the-art detector, the performance gain is reasonable

Some qualitative results are shown in Fig 4.8 Based on the obtained results, it

is fair to say that the proposed method works well on images with single-object andmultiple objects from the same category, and it can handle interacting objects fromdi↵erent classes in most cases However, there still exist some failure cases (see thelast column on Fig 4.8) mainly due to the wrong classification of the segments ormissed prediction of the object

Trang 25

Table 4.4: State-of-the-art comparison on the MSRC-21 dataset [4] in terms of class accuracy, defined in (4.10) The Prop-WeaklySup method uses no annotationdata from the 6 background categories and is evaluated only on the 15 foregroundobjects; while the Prop-FullySup method uses the annotation data across all the 21background categories FgAvg is the average per-class performance of 15 foregroundobjects, while FullAvg is the average performance of all the 21 categories.

WeaklySup

Since one of the main contributions of the chapter lies in its ability to obtain thebackground context in a weakly supervised manner and in how to use it to augmentthe foreground categories, we first evaluate our proposed method focusing only onthe 15 foreground objects without using the annotated background ground-truth

Định dạng
Số trang	50
Dung lượng	1,14 MB