Improving Semantic Texton Forests with a Markov Random Field for Image Segmentation Dinh Viet Sang Hanoi University of Science and Technology sangdv@soict.hust.edu.vn Mai Dinh Loi H
Trang 1Improving Semantic Texton Forests with a Markov
Random Field for Image Segmentation
Dinh Viet Sang
Hanoi University of Science and
Technology
sangdv@soict.hust.edu.vn
Mai Dinh Loi Hanoi University of Science and
Technology csmloi89@gmail.com
Nguyen Tien Quang Hanoi University of Science and
Technology octagon9x@gmail.com Huynh Thi Thanh Binh
Hanoi University of Science and
Technology binhht@soict.hust.edu.vn
Nguyen Thi Thuy Vietnam National University of
Agriculture ntthuy@vnua.edu.vn
ABSTRACT
Semantic image segmentation is a major and challenging problem
in computer vision, which has been widely researched over
decades Recent approaches attempt to exploit contextual
information at different levels to improve the segmentation
results In this paper, we propose a new approach for combining
semantic texton forests (STFs) and Markov random fields (MRFs)
for improving segmentation STFs allow fast computing of texton
codebooks for powerful low-level image feature description
MRFs, with the most effective algorithm in message passing for
training, will smooth out the segmentation results of STFs using
pairwise coherent information between neighboring pixels We
evaluate the performance of the proposed method on two
well-known benchmark datasets including the 21-class MSRC dataset
and the VOC 2007 dataset The experimental results show that
our method impressively improved the segmentation results of
STFs Especially, our method successfully recognizes many
challenging image regions that STFs failed to do
Keywords
Semantic image segmentation, semantic texton forests, random
forest, Markov random field, energy minimization
1 INTRODUCTION
Semantic image segmentation is the problem of partitioning an
image into multiple semantically meaningful regions
corresponding to different object classes or parts of an object For
example, given a photo taken in a city, the segmentation
algorithm will assign to each pixel a label such as building,
human, car or bike It is one of the central problems in computer
vision and image processing
This problem has drawn the attention of researchers in the field
over decades with a large number of works has been published [6, 7, 12, 15, 16, 30, 31] Despite of advances and improvements
in feature extraction, object modeling and the introduction of standard benchmark image datasets, semantic segmentation is still one of the most challenging problems in computer vision The performance of an image segmentation system mainly depends on three processes: extracting image features, learning a model of object classes and inferring class labels for image pixels In the first process, the challenge is to extract informative features for representation of various object classes Consequently, the second process based on machine learning techniques has to be robust to
be able to separate object classes in the feature spaces In recent researches, there have been focuses on combination of contextual information with local visual features to elucidate regional ambiguities [6, 7, 16, 25] Researchers have resorted to techniques capable of exploiting contextual information to represent object class In [32], the authors have developed efficient frameworks for exploring novel features based on texton and combining appearance, shape and context of object classes in a unified model For second process, state-of-the-art machine learning techniques such as Bayes, SVM, Boosting, Random Forest … are usually used for learning a classifier to classify objects into specific classes However, by using such techniques the image pixels (can be super-pixels or image patches) are labeled independently without regarding interrelations between them Therefore, in the later process, we can further improve the segmentation results by employing an efficient inference model that can exploit the interrelations between image pixels Typically, random field models such as Markov random fields (MRFs) and conditional random fields (CRFs) are often used for this purpose
In [32], Shotton et al proposed semantic texton forests (STFs) that used many local region futures and built a second random decision forest that is a crucial factor of their robust segmentation system The use of Random Forests has advantages including: the computational efficiency in both training and classification, the probabilistic output, the seamless handling of a large variety of visual features and the inherent feature sharing of a multi-class classifier The STFs model, which exploited the superpixel-based approach, acted on image patches that allowed very fast in both computing of image features and learning the model
In this paper, we propose two schemes to embed the probabilistic results of STFs in a MRF model In the first scheme, the MRF model will work on the pixel-level results of STFs to smooth out the segmentation In the second scheme, in order to reduce the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page Copyrights
for components of this work owned by others than ACM must be
honored Abstracting with credit is permitted To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee Request permissions from
Permissions@acm.org
SoICT '14, December 04 - 05 2014, Hanoi, Viet Nam
Copyright 2014 ACM 978-1-4503-2930-9/14/12$15.00
Trang 2computational time, we directly apply the MRF model to the
superpixel-level results of STFs These proposed schemes, which
combine a strong classifier with an appropriate contextual model
for inference, are expected to build an effective framework for
semantic image segmentation
This paper is organized as follows: In section 2 we briefly review
the related work on semantic image segmentation In section 3,
we briefly revise STFs and MRFs models and, especially, a group
of effictive algorithms that exploit the approach to minimizing
Gibbs energy on MRFs Then we present our combining schemes
for semantic image segmentation in detail Our experiments and
evaluation on real-life benchmark datasets are demonstrated in
section 4 The conclusion is in section 5 with a discussion for the
future work
2 RELATED WORK
Semantic image segmentation have been an active research topic
in recent years Many works have been developed, which
employed techniques from various related fields over three
decades In this section, we give an overview of semantic image
segmentation methods that are most relevant to our work
Beginning at [4, 5, 23, 26], the authors used the top-down
approach to solve this problem, in which parts of the object are
detected as object fragments or patches, then the detections can be
used to infer the segmentation by a template These methods
focused on the segmentation of an object-class (e.g., a person)
from the background
Shotton et al [31] introduced a new approach to learn a
discriminative model, which exploits texture-layout filter, a novel
feature type based on textons The learned model used shared
boosting to give an efficient multi-class classifier Moreover, the
accuracy of image segmentation was achieved by incorporating
these classifiers in a simple version of condition random field
model This approach can handle a large dataset with up to 21
classes Despite an impressive segmentation results, it has a
disadvantage that the average segmentation accuracy is still low,
that is still far from being satisfied Therefore, the researches in
[20, 22, 35] have focused on improving the inference model in
this work with a hope that the new inference model will improve
the segmentation accuracy
The authors in [3, 27] researched an application of the
evolutionary technique for semantic image segmentation They
employed a version of genetic algorithm to optimize parameters
of weak classifiers to build a strong classifier for learning object
classes Moreover, they exploited informative features such as
location, color and HOG aiming to improve the performance of
the segmentation process Experimental results shown that genetic
algorithms could effectively find optimal parameters of weak
classifiers and improve the performance However, genetic
algorithms make the learning process become very complicated,
and the achieved performance is not as high as expected
In [29, 32], the authors investigated the use of Random Forest for
semantic image segmentation Schroff et al [29] showed that
dissimilar classifiers can be mapped onto a Random Forest
architecture The accuracy of image segmentation can be
improved by incorporating the spatial context and discriminative
learning that arises naturally in the Random Forest framework
Besides that, the combination of multiple image features leads to
further increase in performance
In [32] Shotton introduced semantic texton forests (STFs) and demonstrated the use for semantic image segmentation A semantic texton forest is an ensemble of decision trees that works directly on image pixels STFs do not use the expensive computation of filter-bank or local descriptors The final semantic segmentation is obtained by applying locally the bag of semantic textons with a sliding window approach This efficient method is extremely fast to both train and test, suitable for real-time applications However, the segmentation accuracy of STFs is also still low
Markov random fields are popular models in image segmentation problem [7, 18, 19, 33] One of the most popular MRFs is the pairwise interactions model, which has been extensively used because it allows efficient inference by finding its maximum a posteriori (MAP) solution The pairwise MRF allows the incorporation of statistical relationships between pairs of random variables The using of MRFs helps to improve the segmentation accuracy and smooth out the segmentation results
In this paper, we use random forest for building multi-class classifiers, with the image pixel labels inferred by MRFs This approach is expected to improve the image segmentation accuracy
of STFs
3 OUR PROPOSED APPROACH 3.1 Semantic texton forests
Semantic texton forests (STFs) are randomized decision forests that work at simple image pixel level on image patches for both clustering and classification [29, 32] In this section, we briefly present main techniques in STFs that we will use in our framework In the following, we dissect the structure and decision nodes in Decision trees (Fig 1)
Figure 1 Decision tree A binary decision tree with its node
functions and a threshold
For a pixel in position t , the node function t can be described as:
w
t r S r r f
where r indexes one or two rectangles (i.e., S{1} or {1,2}), wr describes a filter selecting the pixels in the rectangle
r
R and a weighting for each dimension of the feature vector f r(a concatenation of all feature channels and pixels in R r , e.g.,
1 [ 1 2 n]
f G G G , if R1accumulates over the green channel G ,
Each tree is trained using a different subset of the training data
When training a tree, there are two steps for each node:
1 Randomly generate a few decision rules
Trang 32 Choose the one that maximally improves the ability of the tree
to separate classes
where ( )E I is the entropy of the classes in the set of examples
I ; I l is the set of left nodes which have split function
value ( )f v i less than threshold and I ris the set of right nodes
This process stops when the tree reached a pre-defined depth, or
when no further improvement in classification can be achieved
Random forests are composed of multiple independently learned
random decision trees
trees A feature vector is classified by descending each tree This
gives, for each tree, a path from root to leaf, and a class
distribution at the leaf (b) Semantic texton forests features The
split nodes in semantic texton forests use simple functions of raw
image pixels within a d d patch: either the raw value of a
single pixel, or the sum, the difference, or absolute difference of a
pair of pixels (red)
The split functions in STFs act on small image patches p of size
d d pixels, as illustrated in Fig 2b These functions can be (i)
the value p x y b, , of a single pixel at location ( , )x y in color
channel b , or (ii) the sum p x y b1, ,1 1p x y b2, ,2 2 , or (iii) the
difference
1 , , 1 1 2 , , 2 2
p p , or (iv) the absolute difference
1 , , 1 1 2 , , 2 2
|p x y b p x y b |of a pair of pixels ( , )x y1 1 and from possibly
different color channels b1and b2
Figure 3 Semantic Textons.
Some learned semantic textons are visualized in Fig 3 This is a
visualization of leaf nodes from one tree (distance 21 pixels)
Each patch is the average of all patches in the training images
assigned to a particular leaf node l Features evidence include
color, horizontal, vertical and diagonal edges, blobs, ridges and
corners
To textonize an image, a d d patch centered at each pixel is
passed down the STF resulting in semantic texton leaf nodes
1 2
( , , , )T
L l l l and the averaged class distribution ( | )p c L
For each pixel in the test image: We apply the segmentation forest, i.e., marking a path in each tree (yellow node in Fig 2a) Each leaf is associated with a histogram of classes Taking the average the histograms from all tree, we achieve a vector of probabilities (Fig 4) for this pixel belonging to each class
Figure 4 An example of a vector that has 21 probability
values corresponding 21 classes
The probability vectors derived from the Random Forests can be used to classify pixels to classes, by assigning to each pixel the label that is most likely In our framework, for improving the performance, we use these vectors as input to MRF model
3.2 Markov random fields
In classical pattern recognition problem objects are classified independently However, in the modern theory of pattern recognition the set of objects is usually treated as an array of interrelated data The interrelations between objects of such a data array are often represented by an undirected adjacency graph ( , )
G where is the set of objects t and is the set of edges ( , )s t connecting two neighboring objects ,
s t In linearly ordered arrays the adjacency graph is a chain
Hidden Markov models have proved to be very efficient for processing data array with a chain-type adjacency graph, e.g speech signals [28] However, for arbitrary adjacency graphs with cycles, e.g., 4-connected grid of image pixels, finding the
maximum a posteriori estimation (MAP) of a MRF is a NP-hard
problem As a rule the standard way to deal with this problem is
to specify the posteriori distribution of MRFs by using clique potentials instead of local characteristics, and then to solve the problem in terms of Gibbs energy [14] Hereby, the problem of finding a MAP estimation corresponds to minimizing Gibbs
energy E over all cliques of the graph G Image segmentation involves assigning each pixel t a label
{1, 2, , }
t
x m , where m is the number of classes The
interrelations between image pixels are naturally represented by a 4-connected grid that contains only two types of cliques: single
cliques (i.e., individual pixels t ) and binary cliques (i.e., graph edges ( , )s t connecting two neighboring pixels) The
energy function E is composed of a data energy and a
smoothness energy:
( , )
data smootht t t s t st s t
The data energy E data is simply the sum of potentials on single cliques ( )t x t that measures the disagreement between a label
t
x and the observed data In a MRF frame work, the potential on a single clique is often specified as the negative log
of the a posteriori marginal probability obtained by an independent classifier such as Gaussian mixture model (GMM) The smoothness data E smooth is the sum of pairwise interaction potentials st( , )x x s t on binary cliques ( , )s t These potentials are often specified using the Potts model [14]:
Trang 40, ; ( , )
s t
st s t
s t
x x
In general, minimizing Gibbs energy is also an NP-hard problem
Therefore, researchers have focused on approximate optimization
techniques The algorithms that were originally used, such as
simulated annealing [1] or iterated conditional modes (ICM) [2],
proved to be inefficient, because they are either extremely slowly
convergent or easy to get stuck in a weak local minimum
Over the last few years, many powerful energy minimization
algorithms have been proposed The first group of energy
minimization algorithms is based on max-flow and move-making
methods The most popular members in this group are graph-cuts
with expansion-move and graph-cuts with swap-move [8, 33]
However, the drawback of graph-cuts algorithms is that they can
be applied only to a limited class of energy functions
If an energy function does not belong to this class, one has to use
more general algorithms In this case, the most popular choice is
to use the group of message passing algorithms such as loopy
belief propagation (LBP) [11], tree-reweighted massage passing
(TRW) [34] or sequential tree-reweighted massage passing
(TRWS) [19]
In general, LBP may go into an infinite loop Moreover, if LBP
converges, it does not allow us to estimate the quality of the
resulting solution, i.e., how close it is to the global minimum of
energy The ordinary TRW algorithm in [34] formulates a lower
bound on the energy function that can be used to estimate the
resulting solution and try to solve dual optimization problems:
minimizing the energy function and maximizing the lower bound
However TRW does not always converge and does not guarantee
that the lower bound always increase with time
To the best of our knowledge, the sequential tree-reweighted
massage passing (TRWS) algorithm [19, 33], which is an
improved version of TRW, is currently considered to be the most
effective algorithm in the group of message passing algorithms In
TRWS the value of the lower bound is guaranteed not to decrease
Besides that, TRWS requires only half as much memory as other
message passing algorithms including BP, LBP and TRW
Let k k( ),
M be the message that pixel s sends to
its neighbor t at iteration k This message is a vector of size m
and it is updated as follows:
( , )
( ) min ( ) ( ) ( ) ( , )
s
u s
where st is a weighting coefficient
In TRWS, we first pick an arbitrary ordering of pixels ( ),i t t
During the forward pass, pixels are processed in the order of
increasing ( )i t The messages from pixel t are sent to all its
forward neighbors s (i.e., pixels s with ( ) i s i t( ) ) In the
backward pass, a similar procedure of message passing is
performed in the reverse order The messages from each pixel s
are sent to all its backward neighbors t with ( ) i t i s( )
Given all messages Mst, assigning labels to pixels is performed
in the order ( )i t as described in [19]
Each image pixel t is assigned to a label x t that minimize
i s i t i s i t
3.3 Combining STFs outputs using MRFs
STFs have been shown to be extremely fast in computing features for image representation, as well as in learning and testing the model However, the quality of the segmentation results obtained
by STFs is not very high, still far from expectation In this paper,
we propose a new method to improve the results of STFs using MRFs
A result of STFs is a three-dimensional matrix of probabilities that indicate how likely an image pixel is to belong to a certain class The result of STFs can be treated as a “noise” and can be denoised by embedding it in a MRF model Negative log of the probabilities obtained by STFs is used to specify the potentials on single cliques in the MRF model, i.e., the data energy term in
Eq (3)
STFs exploit the superpixel-based approach that acts on small
image patches p of size d d All pixels that lie in the same patch are constrained to have the same class distribution The superpixel-level result S sp obtained by STFs is an array of size h d/ w /d, where is the floor function; and , wh
are the height and width of the original image, respectively Each superpixel of S sp representing a patch of size d d has a class
distribution, which is a vector of size m
In order to generate the pixel-level result S p of size h w from the superpixel-level result S sp, we just need to assign each pixel ( , )i j in S p the class distribution of the pixel ( /i d , j d/ ) in
sp
S This operation can be formally expressed as follows:
( , ) ( / , / )
Hereafter, we propose two schemes to embed the outputs of STFs
in a MRF model In the first scheme the MRF model is applied directly to the results of STFs at pixel level In the second scheme the results of STFs at superpixel level are taken to be improved using the MRF model
The first scheme is described as follows:
pixel level
1 Apply STFs to achieve the superpixel-level result S sp
2 Generate the pixel-level result S from 1 S sp using Eq (6)
3 Apply the TRWS algorithm described in section 3.2 to S1
to get the improved result S2
4 Perform pixel-labeling on S2 using Eq (5) to get S p
5 Return segmentation resultS p
Trang 5The second scheme is described as follows:
superpixel level
1 Apply STFs to achieve the superpixel-level result S sp
2 Apply the TRWS algorithm described in section 3.2 to S sp
to get the improved result 1
sp
3 Generate the pixel-level result S1 from 1
sp
S using Eq (6)
4 Perform pixel-labeling on S1 using Eq (5) to get S p
5 Return segmentation result S p
In these schemes we use the TRWS algorithm described in the
previous section for learning the MRF model The reason is that
according to all criteria including the quality of solution, the
computational time and the memory usage TRWS are almost
always the winner among general energy minimization algorithms
[17, 33] Compared to the first scheme, the second one is an
accelerated version because it reduces the number of variables in
the model Since TRWS has linear computational complexity, the
second scheme will perform faster, approximately d2 times faster
than the first one
4 EXPERIMENTS AND EVALUATION
4.1 Datasets
We conducted experiments on two well-known benchmark
datasets for image segmentation, including the MSRC dataset [31]
and the challenging VOC 2007 segmentation dataset [9]
a resolution of 320x240) of the following 21 classes of objects:
building, grass, tree, cow, sheep, sky, aeroplane, water, face, car,
bike, flower, sign, bird, book, chair, road, cat, dog, body, boat,
They can be divided into 5 groups: environment (grass, sky,
water, road), animals (cow, sheep, bird, cat, dog), plants (tree,
flower), items (building, airplane, car, bicycle, sign, book, chair,
boat) and people (face, body) Each image comes with a
pre-labeled image (ground-truth) with color index, in which each
color corresponds to an object Note that the pre-labeled
(ground-truth) images contains some pixels labeled as “void” (black)
These “void” pixels do not belong to any one of the above listed
classes and will be ignored during training and testing
of 422 images with totally 1215 objects collected from the flickr
photo-sharing website The images of VOC 2007 segmentation
dataset are manually segmented with respect to the 20 following
classes: aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow,
dining table, dog, horse, motorbike, person, potted plant, sheep,
train, TV/monitor The pixels do not belonging to any one of the
above classes are classified as background pixels, which are
colored in black in the pre-labeled (ground truth) images In
contrast to the MSRC dataset, the black pixels are still used for
training and testing as the data of the additional class
“background” Besides that, the pixels colored in white are treated
as “void” class and will be ignored during training and testing
Experiment setting
In the experiments, the system was run 20 times for each splitting
of training-validation-testing data from MSRC dataset All the programs were run on a machine with Core i7-4770 CPU 3.40GHz (8 CPUs), RAM 32GB DDR III 1333Mhz, Windows 7 Ultimate, and implemented in C#
For the experiments, the data is split into roughly 45% for training, 10% for validation and 45% for testing The splitting should ensure approximately proportional contribution of each class For the STFs experiments, we perform tests on a variety of different parameters (see on Table 1)
Table 1 Parameters of Semantic texton forests in the test on
the MSRC dataset
Test 1 Test 2 Test 3 Test 4
Maximum
Threshold test
Data per tree 0.5 0.5 0.5 0.5 Patch size 8 8 8 8 4 4 2 2
We found that the following parameters of STFs gives the best performance for our system: distance 21 , 5T trees, maximum depth D15, 500 features test, 5 threshold test per split and 0.5 of the data per tree, with patch size 2 2 pixel
4.2 Evaluation
We make a comparison for overall accuracy of segmentation We use two measurements for evaluating segmentation result for the MSRC dataset as in [3, 29, 31, 32] and one measurement for the VOC 2007 segmentation dataset as in [9] The global accuracy on the MSRC dataset is the percentage of image pixels correctly assigned to the class label in total number of image pixels, which
as calculated as follows:
,
ii i ij
i j
N global
N
The average accuracy for all classes on the MSRC dataset is calculated as:
ij
i j
N average
where {1,2, , },m m21 is the label set of 21-class MSRC image dataset; N ij is number of pixels of label i which are
assigned to label j
Trang 6Table 2 Pixel-wise accuracy (across all folds) for each class (rows) on MSRC dataset and is row- normalized to sum to 100%
Row labels indicate the true class and column labels the predicted class
building grass tree cow sheep sky aero plane water face car Bike flower sign bird book chair road cat dog body boat
building 39.1 1.5 6.2 0.4 1.1 7.5 10.0 5.4 3.3 2.6 1.4 0.0 3.2 2.6 0.0 0.2 12.5 0.4 0.3 0.9 1.5 grass 0.0 93.7 1.7 1.4 0.9 0.0 0.7 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.4 0.3 0.0 0.1 0.4 0.0
tree 3.2 16.6 66.5 0.2 0.3 5.2 3.9 1.3 0.2 0.1 0.1 0.0 0.1 0.6 0.1 0.0 0.2 0.0 0.1 0.7 0.6
cow 1.9 7.5 0.7 74.4 7.8 0.8 0.8 0.6 0.0 0.0 0.0 0.1 0.0 0.1 3.7 0.0 0.0 0.0 1.2 0.3 0.0
sheep 0.0 4.7 0.0 0.6 90.0 1.1 2.8 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.4 0.0 0.0 0.0 0.0
sky 0.4 0.2 1.8 0.0 0.0 93.7 0.6 2.7 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.3
aero plane 0.6 1.2 0.9 0.0 0.0 5.3 85.8 1.7 0.0 0.0 0.0 0.0 1.5 0.1 0.0 0.0 2.9 0.0 0.1 0.0 0.0
water 2.0 0.5 4.5 0.0 0.0 16.3 0.0 57.9 0.1 0.0 0.9 0.0 2.1 0.6 0.0 0.2 6.9 2.2 0.9 0.7 4.2
face 0.6 0.0 0.8 0.1 0.1 0.3 0.0 0.0 94.1 0.0 0.0 0.0 0.5 0.0 0.1 0.0 0.0 0.2 0.0 3.2 0.0
car 6.7 0.1 2.8 0.0 0.0 0.0 0.0 6.1 0.1 64.1 8.0 0.0 1.9 1.4 0.0 0.2 6.1 1.4 0.0 0.2 0.9
bike 8.6 0.8 1.9 0.0 0.0 0.0 0.1 0.7 0.1 0.1 72.8 0.0 0.6 0.0 0.0 0.6 13.2 0.5 0.0 0.0 0.1
flower 0.2 14.2 5.4 8.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 61.1 0.0 3.7 5.7 0.6 0.1 0.0 0.0 0.9 0.0
sign 7.7 1.2 7.0 0.0 0.0 0.7 0.0 6.3 0.0 0.3 0.4 1.0 64.3 3.0 3.5 0.1 2.3 0.9 0.1 0.0 1.0
bird 1.8 6.4 7.7 2.7 6.6 3.2 4.3 1.2 0.0 5.0 4.3 0.4 3.5 33.8 0.0 0.2 14.4 2.2 1.9 0.0 0.5
book 2.9 0.0 0.2 0.0 0.0 0.0 0.0 0.1 1.2 0.2 0.1 1.4 3.6 0.0 87.3 0.0 0.4 0.3 0.5 1.7 0.0
chair 4.2 5.3 12.9 7.2 0.0 0.0 0.0 1.0 1.6 2.0 1.6 0.3 0.5 0.4 0.1 46.4 6.5 3.7 5.2 0.4 0.8
road 1.9 1.0 1.3 0.0 0.0 1.5 0.5 3.9 0.5 1.2 0.5 0.0 0.5 0.0 0.0 0.2 83.4 1.5 0.6 1.3 0.0
cat 3.6 0.1 0.0 4.9 0.0 0.0 0.0 0.2 0.1 0.0 0.8 1.5 4.0 0.1 18.2 0.0 4.6 58.5 2.6 0.4 0.2
dog 3.8 2.2 1.1 0.5 0.1 0.6 0.1 0.5 9.9 0.0 0.0 0.0 0.0 0.7 0.0 0.4 7.4 4.3 62.2 6.1 0.1
body 3.9 1.8 1.3 1.3 0.6 0.3 0.0 0.7 9.8 1.8 0.1 0.4 2.5 0.0 0.6 0.0 4.6 0.1 0.2 69.2 0.6
boat 7.3 0.0 4.6 0.0 0.0 4.8 0.0 14.6 0.2 0.6 0.0 0.0 1.9 0.5 0.0 0.0 1.0 0.0 0.2 0.7 63.7
Table 3 Segmentation accuracies (percent) over the whole MSRC dataset, Joint boost, STFs and our schemes
` building grass tree cow sheep sky aero plane water face car bike Flower sign bird book chair road cat dog body boat Global Average
Joint boost
[31] 62 98 86 58 50 83 60 53 74 63 75 63 35 19 92 15 86 54 19 62 7 71 58
STFs 37.9 93.0 65.5 75.0 89.8 93.1 85.3 57.5 93.3 61.3 71.1 60.8 63.0 33.9 85.4 46.0 81.9 57.6 62.5 68.4 64.2 72.4 68.9
Our scheme 1 39.1 93.6 66.0 74.5 89.8 93.6 85.8 57.8 93.9 63.8 72.5 61.1 64.1 33.8 86.8 46.2 83.2 58.0 62.2 69.3 63.4 73.2 69.5 Our scheme 2 39.1 93.7 66.5 74.4 90.0 93.7 85.8 57.9 94.1 64.1 72.8 61.1 64.3 33.8 87.3 46.4 83.4 58.5 62.2 69.2 63.7 73.4 69.6
Figure 5 MSRC segmentation results Segmentations on test images using semantic texton forests (STFs) and our schemes.
Trang 7For the VOC 2007 segmentation dataset, we assessed the
segmentation performance using a per class measure based on the
intersection of the inferred segmentation and the ground truth,
divided by the union as in [9]:
N accuracy of i class
where {1,2, , },m m21 is the label set of the VOC 2007
segmentation dataset; N ij is number of pixels of label i
which are assigned to label j
Note that pixels marked “void” in the ground truth are excluded
from this measure
The performance of our system in term of segmentation accuracy
on the MRSC 21-class dataset is shown in Table 2 The overall
classification accuracy is 73.4% From Table 2, we can see that
the best accuracies are for the classes which have many training
samples, e.g., grass, sky, book and road Besides that, the lowest
accuracies are for classes with fewer training samples such as
boat, chair, bird and dog
For the MRSC dataset we also make comparisons with some
recently proposed systems including Joint Boost [31] and STFs
[32] The segmentation accuracy of each class is shown on the
Table 3 Fig 5 show some test images and the segmentation
results by our schemes We can see that our schemes substantially
improve the quality of segmentation smoothing out the results of
STFs Especially, our schemes successfully remove many small
regions that STFs failed to recognize
For the challenging VOC 2007 segmentation dataset we compare
our schemes with some other well-known methods such as TKK
[10] and CRF+N=2 [13] Table 4 shows the segmentation
accuracy of each class We can see that our schemes outperform
all other methods and give an impressive improvement in
comparison with STFs For many classes our schemes achieve
the most accurate results Furthermore, it should be emphasized
that our second scheme is better the first one while performing
faster than d2 times, where d d is the patch size Some of the
segmentation results of some test images from the VOC 2007
dataset are shown in Fig 6 Our combining schemes successfully
remove many small missed classified regions to improve the
quality of the segmentation
5 CONCLUSION
This paper has presented a new approach for improving the image
segmentation accuracy of STFs using MRFs We embedded the
segmentation results of STFs in a MRF model in order to smooth
out them using pairwise coherence between image pixels
Specifically, we proposed two schemes of combining STFs and
MRFs In these schemes the TRWS algorithm was applied in the
role of a MRF model The experimental results on benchmark
datasets demonstrated the effectiveness of the proposed approach,
which substantially improve the quality of segmentation obtained
by STFs Especially, on the very challenging VOC 2007 dataset
our proposed approach give very impressive results and
outperforms many other well-known segmentation methods
In the future, we will conduct more research on Random Forest to
make it more suitable for the semantic segmentation problem We
also plan to employ more effective inference model such as CRFs into the framework to improve the segmentation accuracy
6 REFERENCES
[1] S Barnard Stochastic Stereo Matching over Scale Int’l J
[2] J Besag On the Statistical Analysis of Dirty Pictures (with
discussion) J Royal Statistical Soc., Series B,
48(3):259-302, 1986
[3] H T Binh, M D Loi, T T Nguyen, Improving Image
Segmentation Using Genetic Algorithm Machine Learning
[4] E Borenstein and S Ullman Class-specific, top-down
segmentation In Proc ECCV, p 109–124, 2002
[5] E Borenstein and S Ullman Learning to segment In Proc
2004
[6] E Borenstein, E Sharon, and S Ullman, Combining
top-down and bottom-up segmentation, In Proc CVPRW, 2004
[7] Y Y Boykov and M P Jolly Interactive graph cuts for optimal boundary and region segmentation of objects in N-D
images In Proc ICCV, volume 2, pages 105–112, 2001.
[8] Y Boykov, O Veksler, and R Zabih Fast approximate
energy minimization via graph cuts IEEE PAMI,
3(11):1222–1239, 2001
[9] M Everingham, L Van Gool, C K Williams, J Winn, A Zisserman The pascal visual object classes (voc)
challenge International journal of computer vision, 88(2),
303-338, 2010
[10] M Everingham, L Van Gool, C K I.Williams, J.Winn, and
A Zisserman The PASCAL VOC Challenge 2007 http://www.pascalnetwork.org/challenges/VOC/voc2007/wor kshop/index.html
[11] P Felzenszwalb and D Huttenlocher Efficient Belief
Propagation for Early Vision Int’l J Computer Vision,
70(1):41-54, 2006
[12] R Fergus, P Perona, and A Zisserman Object class
recognition by unsupervised scale-invariant learning IEEE
CVPR, vol 2, p 264–271, June 2003
[13] B Fulkerson, A Vedaldi, S Soatto Class segmentation and
object localization with superpixel neighborhoods In IEEE
670-677, 2009
[14] S Geman and D Geman Stochastic relaxation, Gibbs
distributions, and the Bayesian restoration of images IEEE
PAMI , 6:721-741, 1984
[15] S Gould, T Gao and D Koller Region-based Segmentation
and Object Detection NIPS, 2009
[16] X He, R.S Zemel, M.A Carreira-Perpindn Multiscale
conditional random fields for image labeling In Proc IEEE
CVPR, vol.2, no., pp.II-695,II-702, 2004
[17] J H Kappes, B Andres, F A Hamprecht, C Schnörr, S Nowozin, D Batra, S Kim, B X Kausler, J Lellmann, and
N Komodakis A comparative study of modern inference techniques for discrete energy minimization problems In
Proc IEEE CVPR, 2013
Trang 8Figure 6 VOC 2007 segmentation results Test images with ground truth and our inferred segmentations.
Table 4 Segmentation accuracies (percent) over the whole VOC 2007 dataset
aero plane bicycle bir
boat bottle bus car cat chair cow table dog hor
motorbike per
plant sheep sofa train tv / monitor Average
Brookes [10] 77.7 5.5 0 0.4 0.4 0 8.6 5.2 9.6 1.4 1.7 10.6 0.3 5.9 6.1 28.8 2.3 2.3 0.3 10.6 0.7 8.5
MPI_ESSOL [10] 2.6 29.7 30.8 9.5 41.4 6.7 8 72.9 55.7 37.1 11.1 19.4 2.2 14.9 23.8 66.8 25.9 8.6 3.2 58.1 55.1 27.8
INRIA_PlusClass
[10] 2.9 0.6 44.8 34.4 16.4 19.9 0.4 68 58.1 10.5 0.4 43.5 7.7 0.9 1.7 59.2 37.2 0 5.5 19 63.2 23.5
TKK [10] 22.9 18.8 20.7 5.2 16.1 3.1 1.2 78.3 1.1 2.5 0.8 23.4 69.4 44.4 42.1 0 64.7 30.2 34.6 89.3 70.6 30.4
CRF+N=2 [13] 56 26 29 19 16 3 42 44 56 23 6 11 62 16 68 46 16 10 21 52 40 32 STFs 68.4 42.9 28.1 54.6 34.8 44.8 64.4 47.8 59.4 30.8 43.5 46.3 38.4 48.6 54.8 47.1 27.6 51.6 46.8 67.6 44.3 46.2 Our scheme 1 74.2 45.2 33.6 61.3 37.9 52.6 68.3 53.7 68.0 41.3 48.0 51.5 43.2 53.4 58.8 52.2 34.4 60.0 54.7 72.7 52.0 52.1 Our scheme 2 76.2 46.0 34.5 65.4 38.9 54.4 70.0 56.0 71.5 43.7 48.8 52.6 44.5 55.3 59.4 53.8 37.3 62.6 56.1 74.4 55.5 54.0
[18] Z Kato, T.C Pong A Markov random field image
segmentation model for color textured images IVC(24), No
10, 1 October 2006, pp 1103-1114
[19] V Kolmogorov Convergent tree-reweighted message
passing for energy minimization IEEE PAMI, 28(10):1568–
1583, 2006
[20] P Krahenbuhl, V Koltun Efficient Inference in Fully
Connected CRFs with Gaussian Edge Potentials NIPS, 2011
[21] M P Kumar, P H S Torr, and A Zisserman OBJ CUT In
Proc IEEE CVPR, San Diego, volume 1, pages 18–25, 2005
[22] L Ladicky, C Russell, P Kohli and P H.S Torr Associative Hierarchical CRFs for Object Class Image
Segmentation In Proc ICCV, 2009
[23] B Leibe, A Leonardis, and B Schiele Combined object categorization and segmentation with an implicit shape
model In Workshop, ECCV, May 2004
[24] S Z Li Markov Random Field Modeling in Image Analysis Springer–Verlag, London, 2009
[25] J Malik, S Belongie, T Leung, and J Shi Contour and
texture analysis for image segmentation IJCV, 43(1):7–27,
June 2001
Trang 9[26] A Opelt, A Pinz, and A Zisserman A
boundary-fragment-model for object detection Proc ECCV, Graz, Austria, 2006.
[27] N T Quang, H T Binh, T T Nguyen, Genetic Algorithm
in Boosting for Object Class Image Segmentation SoCPAR,
2013
[28] L R Rabiner A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition Proc IEEE,
77 1977 V 2 P 257–286
[29] F Schroff, A Criminisi, A Zisserman Object Class
Segmentation using Random Forests BMVC, 2008
[30] J Shi and J Malik, Normalized Cuts and Image
Segmentation IEEE Trans.PAMI, 22(8): 888-905, 2000
[31] J Shotton, J Winn, C Rother, and A Criminisi
TextonBoost: Joint appearance, shape and context modeling
for multi-class object recognition and segmentation In Proc
ECCV, pages 1-15, 2006
[32] J Shotton, M Johnson and R Cipolla Semantic texton forests for image categorization and segmentation In Proc
[33] R Szeliski, R Zabih, D Scharstein, O Veksler, V Kolmogorov, A Agarwala, M Tappen, and C Rother A comparative study of energy minimization methods for
Markov random fields with smoothness-based priors IEEE
PAMI, 30(6):1068–1080, 2008
[34] M J Wainwright, T.S Jaakkola, and A.S Willsky MAP estimation via agreement on (hyper)trees: Message-passing
and linear-programming approaches IEEE Transactions on
[35] S Wu, J Geng, F Zhu Theme-Based Multi-Class Object
Recognition and Segmentation In Proc ICPR Istanbul,
Turkey, pages 1-4, August 2010