A new interactive semi supervised clustering model for large image database indexing

We also et al., 2003, AHC Lance and Williams, 1967, R-tree Guttman, et al., 1996 with different real image databases of increasing sizes Wang, PascalVoc2006, Caltech101, Corel30k the num

Trang 1

A new interactive semi-supervised clustering model for large image

database indexing

Hien Phuong Laia,b,c,⇑, Muriel Visania, Alain Bouchera,b,c, Jean-Marc Ogiera

a

L3I, Université de La Rochelle, Avenue M Crépeau, 17042 La Rochelle cedex 1, France

b

IFI, Equipe MSI; IRD, UMI 209 UMMISCO, Institut de la Francophonie pour l’Informatique, 42 Ta Quang Buu, Hanoi, Vietnam

c

Vietnam National University, Hanoi, Vietnam

a r t i c l e i n f o

Article history:

Available online 27 June 2013

Keywords:

Semi-supervised clustering

Interactive learning

Image indexing

a b s t r a c t

Indexing methods play a very important role in ﬁnding information in large image databases They orga-nize indexed images in order to facilitate, accelerate and improve the results for later retrieval Alterna-tively, clustering may be used for structuring the feature space so as to organize the dataset into groups of similar objects without prior knowledge (unsupervised clustering) or with a limited amount of prior knowledge (semi-supervised clustering)

In this paper, we introduce a new interactive semi-supervised clustering model where prior informa-tion is integrated via pairwise constraints between images The proposed method allows users to provide feedback in order to improve the clustering results according to their wishes Different strategies for deducing pairwise constraints from user feedback were investigated Our experiments on different image databases (Wang, PascalVoc2006, Caltech101) show that the proposed method outperforms semi-super-vised HMRF-kmeans (Basu et al., 2004)

1 Introduction

Content-Based Image Retrieval (CBIR) refers to the process

which uses visual information (usually encoded using color, shape,

texture feature vectors, etc.) to search for images in the database

that correspond to the user’s queries Traditional CBIR systems

generally rely on two phases The ﬁrst phase is to extract the

fea-ture vectors from all the images in the database and to organize

them into an efﬁcient index data structure The second phase is

to efﬁciently search in the indexed feature space to ﬁnd the most

similar images to the query image

With the development of many large image databases, an

exhaustive search is generally intractable Feature space

structur-ing methods (normally called indexstructur-ing methods) are therefore

nec-essary for facilitating and accelerating further retrieval They can

be classiﬁed into space partitioning methods and data partitioning

methods

into cells (sometimes referred to as ‘‘buckets’’) of fairly similar

cardinality (in terms of number of images per cell), without taking into account the distribution of the images in the feature space Therefore, dissimilar points may be included in a same cell while similar points may end up in different cells The resulting index

is therefore not optimal for retrieval, as the user generally wants

to retrieve similar images to the query image Moreover, these methods are not designed to handle high dimensional data, while image feature vectors commonly count hundreds of elements

informa-tion about image distribuinforma-tion in the feature space However, the limitations on the cardinality of the space cells remain, causing the resulting index to be non-optimal for retrieval, especially in the case where groups of similar objects are unbalanced, i.e com-posed of different numbers of images

Our claim is that using clustering instead of traditional indexing

to organize feature vectors, results in indexes better adapted to high dimensional and unbalanced data Indeed, clustering aims to split a collection of data into groups (clusters) so that similar ob-jects belong to the same group and dissimilar obob-jects are in differ-ent groups, with no constraints on the cluster size This makes the resulting index better optimized for retrieval In fact, while in tra-ditional indexing methods it might be difﬁcult to ﬁx the number of objects in each bucket (especially in the case of unbalanced data), 0167-8655/$ - see front matter Ó 2013 Elsevier B.V All rights reserved.

⇑Corresponding author at: L3I, Université de La Rochelle, Avenue M Crépeau,

17042 La Rochelle cedex 1, France Tel.: +33 6 46 51 12 32; fax: +33 5 46 45 82 42.

E-mail addresses: hien_phuong.lai@univ-lr.fr (H.P Lai), muriel.visani@univ-lr.fr

(M Visani), alainboucher12@gmail.com (A Boucher), jean-marc.ogier@univ-lr.fr

(J.-M Ogier).

Pattern Recognition Letters

j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / p a t r e c

Trang 2

clustering methods have no limitation on the cardinality of the

clusters, objects can be grouped into clusters of very different sizes

Moreover, using clustering might simplify the relevance feedback

task, as the user might interact with a small number of cluster

pro-totypes rather than numerous single images

Because feature vectors only capture low level information such

as color, shape or texture, there is a semantic gap between

high-level semantic concepts expressed by the user and these low-high-level

features The clustering results are therefore generally different

from the intent of the user Our work aims to involve users in

the clustering phase so that they can interact with the system in

order to improve the clustering results The clustering methods

should therefore produce a hierarchical cluster structure where

the initial clusters may be easily merged or split We are also

inter-ested in clustering methods which can be incrementally built in

or-der to facilitate the insertion or deletion of new images by the user

It can be noted that incrementality is also very important in the

context of huge image databases, when the whole dataset cannot

be stored in the main memory Another very important point is

the computational complexity of the clustering algorithm,

espe-cially in an interactive online context where the user is involved

In the case of large image database indexing, we may be

et al., 2005) or semi-supervised clustering (Basu et al., 2002, 2004;

Dubey et al., 2010; Wagstaff et al., 2001) While no information

about ground truth is provided in the case of unsupervised

cluster-ing, a limited amount of knowledge is available in the case of

semi-supervised clustering The provided knowledge may consist of

class labels (for some objects) or pairwise constraints (must-link

or cannot-link) between objects

InLai et al (2012a), we proposed a survey of unsupervised

clus-tering techniques and analyzed the advantages and disadvantages

of different methods in a context of huge masses of data where

incrementality and hierarchical structuring are needed We also

et al., 2003), AHC (Lance and Williams, 1967), R-tree (Guttman,

et al., 1996)) with different real image databases of increasing sizes

(Wang, PascalVoc2006, Caltech101, Corel30k) (the number of

images ranges from 1000 to 30,000) to study the scalability of

(2012b), we presented an overview of semi-supervised clustering

methods and proposed a preliminary experiment of an interactive

semi-supervised clustering model using the HMRF-kmeans

on the Wang image database in order to analyze the improvement

in the clustering process when user feedback is provided

There are three main parts to this paper Firstly, we propose a

new interactive semi-supervised clustering model using pairwise

constraints Secondly, we investigate different methods for

deduc-ing pairwise constraints from user feedback Thirdly, we

experi-mentally compare our proposed semi-supervised method with

the widely known semi-supervised HMRF-kmeans method

This paper is structured as follows A short review of

2 A short review of semi-supervised clustering methods

For unsupervised clustering only similarity information is used

to organize objects; in the case of semi-supervised clustering a

small amount of prior knowledge is available Prior knowledge is

either in the form of class labels (for some objects) or pairwise

constraints between objects Pairwise constraints specify whether two objects should be in the same cluster (must-link) or in differ-ent clusters (cannot-link) As the clusters produced by unsuper-vised clustering may not be the ones required by the user, this prior knowledge is needed to guide the clustering process for resulting clusters which are closer to the user’s wishes For in-stance, for clustering a database with thousands of animal images,

an user may want to cluster by animal species or by background landscape types An unsupervised clustering method may give, as

a result, a cluster containing images of elephants with a grass back-ground together with images of horses with a grass backback-ground and another cluster containing images of elephants with a sand background These results are ideal when the user wants to cluster

by background landscape types But they are poor when the user wants to cluster by animal species In this case, must-link con-straints between images of elephants with a grass background and images of elephants with a sand background and cannot-link constraints between images of elephants with a grass background and images of horses with a grass background are needed to guide the clustering process The objective of our work is to make the user interact with the system so as to deﬁne easily these con-straints with only a few clicks Note that the available knowledge

is too poor to be used with supervised learning, as only a very lim-ited ratio of the available images are considered by the user at each step In general, semi-supervised clustering methods are used to maximize intra-cluster similarity, to minimize inter-cluster simi-larity and to keep a high consistency between partitioning and do-main knowledge

Semi-supervised clustering has been developed in the last dec-ade and some methods have been published to date They can be divided into semi-supervised clustering with labels, where partial information about object labels is given, and semi-supervised clus-tering with constraints, where a small amount of pairwise con-straints between objects is given

Some semi-supervised clustering methods using labeled objects

con-strained-kmeans are based on the k-means algorithm Prior knowledge for these two methods is a small subset of the input database, called seed set, containing user-speciﬁed labeled objects

of k different clusters Unlike k-means algorithm which randomly selects the initial cluster prototypes, these two methods use the la-beled objects to initialize the cluster prototypes Following this we repeat, until convergence, the re-assignment of each object in the dataset to the nearest prototype and the re-computation of the prototypes with the assigned objects The seeded-kmeans assigns objects to the nearest prototype without considering the prior la-bels of the objects in the seed set In contrast, the constrained-kmeans maintains the labeled examples in their initial clusters and assigns the other objects to the nearest prototype An

Du-bey et al (2010)for document analysis In this model, knowledge

is progressively provided as assignment feedback and cluster description feedback after each interactive iteration Using assign-ment feedback, the user moves an object from one cluster to an-other cluster Using cluster description feedback, the user modiﬁes the feature vector of any current cluster (e.g increase the weighting of some important words) The algorithm learns from all the feedback to re-cluster the dataset in order to minimize average distance between points and their cluster centers while minimizing the violation of constraints corresponding to feedback Among the semi-supervised clustering methods using pairwise constraints between objects, we can cite COP-kmeans

Trang 3

methods is data set X, a set of must-link constraints M and a set of

cannot-link constraints C In COP-kmeans, points are assigned to

until a suitable cluster is found The clustering fails if no solution

respecting the constraints is found While the constraint violation

is strictly prohibited in COP-kmeans, it is allowed with a violation

cost (penalty) in HMRF-kmeans and in semi-supervised

kernel-kmeans The objective function to be minimized in the

semi-super-vised HMRF-kmeans is as follows:

JHMRF Kmeans¼X

xi2X

Dðxi;ll

ðx i ;x j Þ2M;l i –l j

wijþ X

ðx i ;x j Þ2C;l i ¼l j

be either a constant or a function of the distance between the two

points speciﬁed in the pairwise constraint as follows:

where w and w are constants specifying the cost for violating a

between two points in the data set We can see that, to ensure the

most difﬁcult constraints are respected, higher penalties are

as-signed to violations of must-link constraints between points which

are distant and to violations of cannot-link constraints between

can-not-link penalty term sensitive to extreme outliers, but all

cannot-link constraints are treated in the same way, so even in the presence

of extreme outliers, there would be no cannot-link constraint

also sensitive to outliers We can reduce this sensitivity by using

maximum distance between two clusters HMRF-kmeans ﬁrst

ini-tializes the k cluster centers based on user-speciﬁed constraints,

iterative relocation approach similar to k-means is applied to

min-imize the objective function The iterative algorithm represents the

repetition of the assignment phase of each point to the cluster

which minimizes its contribution to the objective function and

the re-estimation phase of the cluster centers minimizing the

function in a transformed space instead of the original space using

a kernel function mapping as follows:

JSS Kernel Kmeans¼X

x i 2X

k/ðxiÞ /l ik2 X

ðx i ;xjÞ2M;l i ¼l j

wijþ X

ðx i ;xjÞ2C;l i ¼l j

wij ð4Þ

give a reward for must-link constraint satisfaction if the two points

are in the same cluster, by subtracting the corresponding penalty

term from the objective function

3 Proposed interactive semi-supervised clustering model

In this section, we present our proposed interactive semi-super-vised clustering model In our model, the initial clustering is car-ried out without any prior knowledge, using an unsupervised

adequa-tion between different unsupervised clustering methods and our applied context (involving user interactivity) as well as experimen-tally compared different unsupervised clustering methods (global

most suitable to our context BIRCH is less sensitive to variations

in its parameters Moreover, it is incremental, it provides a hierar-chical structure of clusters and it outperforms other methods in the context of a large database (best results and best computational time in our tests) Therefore, BIRCH is chosen for the initial unsupervised clustering in our model After the initial clustering, the user views the clustering results and provides feedback to the system The pairwise constraints (must-link, cannot-link) are deduced, based on user feedback; the system then re-organizes the clusters by considering the constraints The re-clustering process is done using the proposed semi-supervised clustering

feed-back and system reorganizes the clusters) is repeated until the clustering result satisﬁes the user The interactive semi-supervised clustering model contains the following steps:

1 Initial clustering using BIRCH unsupervised clustering

2 Repeat:

(a) Receive feedback from the user and deduce pairwise constraints

(b) Re-organize the clusters using the proposed semi-super-vised clustering method

until the clustering result satisﬁes the user

3.1 BIRCH unsupervised clustering Let us brieﬂy describe the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) unsupervised clustering method (Zhang et al., 1996) The idea of BIRCH is to build a Clustering Fea-ture Tree (CF-tree)

We deﬁne a CF-vector, summarizing information of a cluster

LS

!

and SS are respectively the linear sum and the square sum of

i¼1!xi; SS ¼PN

i¼1!xi 2

From the CF-vectors, we can simply compute the centroid, the radius (average distance from points to the centroid) of a cluster and also the distance be-tween two clusters (e.g the Euclidean distance bebe-tween their cen-troids) A CF-tree is a balanced tree having three parameters B; L and T:

Each internal node contains, at most, B elements of the form

also contains two pointers, prev and next, to link leaf nodes

have a radius lower than a threshold T (threshold condition) The CF-tree is created by successively inserting points into the

closest leaf, if the threshold condition is not violated If it is

Trang 4

impos-sible, a new CFj is created for the new point The corresponding

internal and leaf nodes must be split if necessary After creating

the CF-tree, we can use any clustering method (AHC, k-means,

for clustering the leaf entries, as it is suitable to be used with our

proposed semi-supervised clustering in the interactive phase

3.2 Proposed semi-supervised clustering method

At each interactive iteration, our semi-supervised clustering

method is applied after receiving feedback from the users for

re-organizing the clusters according to their wishes Our

semi-super-vised clustering method considers the set of all leaf entries

pro-vided as two sets of pairwise constraints between CF entries

and therefore all points which are included in these two entries

function to be minimized is as follows:

Jobj¼ X

CFi2S CF

DðCFi;lliÞ þ X

ðCF i ;CFjÞ2M CF ;li–lj

wNCFiNCFjDðCFi;CFjÞ

ðCF i ;CFjÞ2C CF ;li¼l j

wNCFiNCFjðDmax DðCFi;CFjÞÞ ð5Þ

where:

The ﬁrst term measures the distortion between each leaf entry

The second and the third terms represent the penalty costs for

respectively violating the must-link and cannot-link constraints

between CF entries w and w are constants specifying the

viola-tion cost of a must-link and a cannot-link between two points

two entries The violation cost of a pairwise constraint between

is the maximum distance between two CF entries in the data

set Therefore, higher penalties are assigned to violations of

must-link between entries that are distant and of cannot-link

between entries which are close As in HMRF-kmeans, the term

extreme outliers, and could be replaced by the maximum

dis-tance between two clusters if the database contains extreme

outliers

In our case, we use the most frequently used squared Euclidean

distance as distortion measure The distance between two entries

CF i;SSCF iÞ, CFj¼ ðNCF j;!LS

the distance between their means as follows:

DðCFi;CFjÞ ¼Xd

p¼1

LSCFiðpÞ

NCF i

LSCFjðpÞ

NCF j

!2

ð6Þ

where d is the number of dimensions of the feature space

The proposed semi-supervised clustering is as follows:

Method:

1 Set t 0

2 Repeat until convergence

flðtþ1Þi gmi¼1 of entries fCFigmi¼1 to minimize the objective function

objec-tive function

(c) t t þ 1

In the re-assignment step, given the current cluster centers,

its contribution to the objective function as follows:

JobjðCFi;lhÞ ¼ DðCFi;lhÞ þ X

ðCF i ;CFjÞ2M CF ;h–lj

wNCFiNCFjDðCFi;CFjÞ

ðCF i ;CFjÞ2C CF ;h¼l j

wNCFiNCFjðDmax DðCFi;CFjÞÞ ð7Þ

We can see that the optimal assignment of each CF entry also de-pends on the current assignment of the other CF entries due to the violation cost of pairwise constraints in the second and third

randomly re-ordered, and the re-assignment process is repeated until no CF entry changes its cluster label between two successive iterations

minimize the objective function of the current assignment For simple calculation, each cluster center is also represented in the form of a CF-vector By using the squared Euclidean measure, the

en-tries which are assigned to this cluster as follows:

Nlh¼X

l i ¼h

LS

!

lh¼X

l i ¼h

LS

!

SSlh¼X

li¼h

func-tion is decreased with this re-assignment Therefore, the objective

step And in each re-estimation step, the mean of the CF-vector

therefore the points) in this cluster, that minimizes the

part in cluster center re-estimation Therefore, the objective

and re-estimation steps, the proposed semi-supervised clustering will converge to a (at least local) minimum in each interactive iteration

After each interactive iteration, new constraints are given to the system These new constraints might be in contradiction with some of the ones previously deduced by the system from the ear-lier user interactive iterations For this reason and also for compu-tational time matters, our system omits at each step some of the

Trang 5

constraints deduced at earlier steps Therefore, the objective

And the convergence of the interactive semi-supervised model is

thus not guaranteed But we can verify the convergence of the

model, practically, by determining, at the end of all interactive

iter-ations, the global objective function which considers all feedback

given by the user in all interactive iterations and then by verifying

if this global objective function has improved or not after different

interactive steps This is a part of our current work

3.3 Interactive interface

In order to allow the user to view the clustering results and to

provide feedback to the systems, we implement an interactive

plane representing all presented clusters by their prototype

images In our system, the maximum number of cluster prototypes

presented to the user on the principal plane is ﬁxed at 30 The

pro-totype image of each cluster is the most representative image of

that cluster chosen as follows In our model, we use the internal

estimate the quality of each image in a cluster The higher the

SW value of an image in a cluster, the more representative this

im-age is for the cluster The prototype imim-age of a cluster is thus the

image with the highest SW value in the cluster Any other internal

measure could be used instead The position of the prototype

im-age of each cluster in the principal plane represents the position

of the corresponding cluster center It means that, if two cluster

centers are close (or distant) in the n-dimensional feature space,

their prototype images are close (or distant) in the 2D principal

plane For representing the cluster centers which are n-dimen-sional vectors in 2D plane, we use Principal Component Analysis

prin-cipal axes associated with the highest eigenvalues The importance

of an axis is represented by its inertia (the sum of the squared

of its inertia in the total inertia of all axes In general, if the two principal axes explain (cumulatively) greater or equal to 80% of the total inertia, the PCA approach could lead to a nice 2D-repre-sentation of the prototype images In our case, the accumulated inertia explained by the two ﬁrst principal axes is about 65% for the Wang and PascalVoc2006 databases and about 20% for the Cal-tech101 and Corel30k image databases As only a maximum of 30 clusters (and therefore 30 prototype images) can be shown to the user in an interactive iteration, a not very nice 2D-representation

of prototype images does not inﬂuence on the results as long as the user can distinguish between the prototype images and have

a rough idea of the distances between the clusters When there are some prototype images which overlap each other, a slight modiﬁcation of the PCA components can help to separate these images

By clicking on a prototype image in the principal plane, the user

by the user is represented by a circle:

The prototype image of this cluster is located at the center of the circle

The 10 most representative images (images with the highest

SW values), which have not received feedback from the user

in the previous iterations, are located in the ﬁrst circle of images around the prototype image, near the center

Fig 1 2D interactive interface The rectangle at the bottom right corner represents the principal plane consisting of the two ﬁrst principal axes (obtained by PCA) of the

Trang 6

The 10 least representative images (images with the smallest

SW values), which have not received feedback from the user

in the previous iterations, are located in the second circle of

images around the prototype image, close to the cluster border

By showing, for each iteration, the images which have not received

user feedback in previous iterations, we wish to obtain feedback

for different images

The user can specify positive feedback and negative feedback

cluster The user can also change the cluster assignment of a given

image by dragging and dropping the image from the original cluster

to the new cluster When an image is changed from cluster A to

clus-ter B, it is considered as negative feedback for clusclus-ter A and positive

feedback for cluster B Therefore, after each interactive iteration, the

process returns a positive image list and a negative image list for

each cluster with which the user has interacted

3.4 Pairwise constraint deduction

In each interactive iteration, user feedback is in the form of

positive and negative images, while the supervised input

infor-mation of the proposed semi-supervised clustering method are

pairwise constraints between CF entries Therefore, we have to

deduce the pairwise constraints between CF entries from the

user feedback

At each interactive iteration and for each interacted cluster, all

positive images should be in this cluster while negative images

should move to another cluster We consider that each image in

the positive set is linked to each image in the negative set by a

can-not-link, while all images in the positive set are linked by

must-links If we assume that all feedback is coherent between different

interactive iterations, we try to group images, which should be in

the same cluster according to the user feedback of all interactive

iterations, in a group called neighborhood We deﬁne:

cluster

includ-ing labels of the neighborhoods which should not be in the

cannot-link neighborhoods if there is at least one cannot-link

After receiving the list of feedback in the current iteration, the

lists Np and CannotNp are updated as follows:

receives interaction from the user:

neighbor-hood ! create a new neighborneighbor-hood for these positive

multiple neighborhoods ! merge these neighborhoods (in

the case of multiple neighborhoods) into one single

neigh-borhood, insert the other positive images which are not

included in any neighborhood to this neighborhood and

the set CannotNp to signify that neighborhoods that had

cannot-link with one of the neighborhoods which has merged, now have cannot-link with the new neighborhood

the neighborhood corresponding to the positive images of

As we assume that the user feedback is coherent among differ-ent interactive iterations, all images in a same neighborhood should be in a same cluster and images of cannot-link neighbor-hoods should be in different clusters There may be cannot-link

corre-sponding to different neighborhoods and other images which are not included in any other neighborhood Cannot-link may or may not exist between seeds of a CF entry With each CF entry that should be split, we present the user with each pair of seeds, which

do not have cannot-link between them, to demand more informa-tion (for each seed, the image which is closest to the center of the seed is presented):

If the user indicates that there is must-link between these two seeds, these seeds and also their corresponding neighborhoods are merged

If the user indicates that there is cannot-link between these two seeds, update the corresponding cannotCF lists specifying that their two corresponding neighborhoods have cannot-link between them

into p different CF entries; each new CF entry contains all points of

assigned to the CF entry corresponding to the closest seed By split-ting the necessary CF entries into purer CF entries, we can elimi-nate the case where cannot-link exists between images of a same

CF or where must-link and cannot-link exist simultaneously between images of two different CF entries Subsequently, pairwise constraints between CF entries can be deduced based on pairwise constraints between images as follows: if there is must-link (or respectively cannot-link) between two images of two CF entries,

a must-link (or respectively cannot-link) is created between these two CF entries

Concerning pairwise constraints between images, a simple and complete way to deduce them is to create must-link between each pair of images of a same neighborhood, and to create, for each pair

con-straints between images in this way, the number of concon-straints be-tween images can be very high, and therefore the number of constraints between CF entries could also be very high The pro-cessing time of the semi-supervised clustering in the next phase could thus be very high due to the high number of constraints There are different strategies for deducing pairwise constraints be-tween images that could reduce the number of constraints and also

cre-ated between positive images of each cluster while cannot-link are created between positive and negative images of each cluster (note 1

For interpretation of color in Fig 1, the reader is referred to the web version of

Trang 7

the displacement feedback corresponding to a negative image of

the source cluster and a positive image of the destination cluster)

4 Experiments

In this section, we present some experimental results of our

interactive semi-supervised clustering model We also,

experimen-tally, compare our semi-supervised clustering model with the

semi-supervised HMRF-kmeans When using the semi-supervised

HMRF-kmeans in the re-clustering phase, the initial unsupervised

clustering is k-means

4.1 Experimental protocol

In order to analyze the performance of our interactive

semi-supervised clustering model, we use different image databases

di-vided into 101 classes)) Note that in our experiments we use the

same number of clusters as the number of classes in the ground

shown to the user on the principal plane; users can choose to view

and interact with any cluster in which they are interested For

dat-abases which have a small number of classes, such as Wang and

Pas-calVoc2006, all prototype images can be shown on the principal

plane For databases which have a large number of classes, such as

Caltech101, only a part of the prototype images can be shown for

visualization In our system, the maximum number of cluster

proto-types shown to the user in each iteration is ﬁxed at 30 We use two

simple strategies for choosing clusters to be shown for each

itera-tion: 30 clusters chosen randomly or iteratively chosen pairs of

clos-est clusters until there are 30 clusters

The external measures compare the clustering results with the

ground truth, thus they are compatible for estimating the quality of

the interactive clustering involving user interaction As different

external measures analyze the clustering results in a similar way

(seeLai et al (2012a)), we use, in this paper, the external measure

V-measure values are, the better the results (compared to the ground-truth)

Concerning feature descriptors, we implement the local

used for its high performance The SIFT descriptor detects interest points from an image and describes the local neighborhood around each interest point by a 128-dimensional histogram of local gradi-ent directions of image intensities The rgSIFT descriptor of each interest point is computed as the concatenation of the SIFT descrip-tors calculated for the r and g components of the normalized RGB

the intensity channel, resulting in a 3128-dimensional vector

to group local features of each image into a single vector It consists

in two steps Firstly, K-means clustering is used to group local fea-tures of all images in the database according to a number dictSize of clusters We then generate a dictionary containing dictSize visual words which are the centroids of these clusters The feature vector

of each image is a dictSize dimension histogram representing the frequency of occurrence of the visual words in the dictionary, by replacing each local descriptor of the image by the nearest visual

descrip-tors are better than global descripdescrip-tors regarding the external mea-sures and the value dictSize ¼ 200 is a good trade-off between the size of the feature vector and the performance Therefore, in our experiments, we use the rgSIFT descriptor together with a visual word dictionary of size 200

In order to undertake the interactive tests automatically, we implement a software agent, later referred to as ‘‘user agent’’ that simulates the behavior of the human user when interacting with the system (assuming that the agent knows all the ground truth containing the class label for each image) At each interactive iter-ation, clustering results are returned to the user agent by the sys-tem; the agent simulates the behavior of the user giving feedback

to the system For simulating the user behavior, we suggest some rules:

At each interactive iteration, the user agent interacts with a ﬁxed number of c clusters

The user agent uses two strategies for choosing clusters: ran-domly chosen c clusters, or iteratively chosen pairs of closest clusters until there are c clusters

Fig 2 Example of pairwise constraint deduction between images from the user feedback.

2 http://wang.ist.psu.edu/docs/related/.

3

http://pascallin.ecs.soton.ac.uk/challenges/VOC/.

4

Trang 8

The user agent determines the image class (in the ground truth)

corresponding to each cluster by the most represented class

among the 21 presented images of the cluster The number of

images of this class in the cluster must be greater than a

thresh-old MinImages If this is not the case, this cluster can be

consid-ered as a noise cluster In our experiments, MinImages ¼ 5 for

databases having a small number of classes (Wang,

Pascal-Voc2006), and MinImages ¼ 2 for databases having a large

num-ber of classes (Caltech101)

When several clusters (among chosen clusters) correspond to a

same class, the cluster in which the images of this class are the

most numerous (among the 21 shown images of the cluster) is

chosen as the principal cluster of this class The classes of the

other clusters are redeﬁned as usual, but neutralize the images

from this class

In each chosen cluster, all images, where the result of the

algo-rithm corresponds to the ground truth, are labeled as positive

samples of this cluster, while the others are negative samples

of this cluster All negative samples are moved to the cluster

(among chosen clusters) corresponding to their class in the

ground truth

con-straints between images based on user feedback in each iteration

and also on the neighborhood information User feedback is in the

form of positive and negative images of each cluster (the image which is displaced from one cluster to another cluster is considered

as a negative image of the source cluster and a positive image of the destination cluster) The neighborhood information is in the form of

user feedback during all interactive iterations, as presented in

for the semi-supervised HMRF-Kmeans, while they have to be

to be used by our proposed semi-supervised clustering We divide pairwise constraints between images into two kinds: user con-straints and deduced concon-straints User concon-straints are created di-rectly, based on user feedback in each iteration, while deduced constraints are created by deduction rules For instance, in the ﬁrst

and cannot-links between positive and negative images in the ﬁrst

Table 1

D75ifferent strategies for deducing pairwise constraints between images based on user feedback and on neighborhood information.

1 All user constraints of all interactive iterations

All deduced constraints of all interactive iterations

All constraints are created based on the neighborhood information:

Must-link between each pair of images of each neighborhood

Cannot-link between each image of each neighborhood Np i 2 Np and each image

of each neighborhood having cannot-link with Np i (listed in cannotNp i )

None of deduced constraints

In each iteration, all possible user constraints are created:

Must-link between each pair of positive images of each cluster

Cannot-link between each pair of a positive image and a negative image of a same cluster

All deduced constraints in the current iteration (deduced constraints

in the previous iterations are eliminated)

In each iteration, all possible user constraints are created as in Strategy 2

Deduced constraints in the current iteration are created while updating the neigh-borhoods as follows:

– If there is a must-link (or cannot-link) ðx i ;x j Þ; x j 2 Np m , deduced must-links (or cannot-links) ðx i ;x l Þ; x l 2 Np m are created

– If there is a must-link (or cannot-link) ðx i ;x j Þ; x i 2 Np m ; x j 2 Np n , deduced must-links (or cannot-links) ðx k ;x l Þ; 8x k 2 Np m ;8x l 2 Np n are created

4 User constraints between images and cluster centers of all

interac-tive iterations

Deduced constraints between images and cluster centers in the

cur-rent iteration (deduced constraints in the previous iterations are

eliminated)

In each iteration, the positive image having the best internal measure (SW) value among all positive images of each cluster is the center of this cluster

Must-link/cannot-link user constraints are created in each iteration between each positive/negative image and the corresponding cluster center

Deduced constraints in the current iteration are created while updating the neigh-borhoods as follows:

– If x i and x j must be in the same (or different) clusters (based on user feedback),

x j 2 Np m , deduced must-links (or cannot-links) are created between x i and each center image of Np m

– If x i and x j must be in the same (or different) clusters (based on user feedback),

x i 2 Np m ;x j 2 Np n , deduced must-links (or cannot-links) are created between

x i and each center image of Np n and between x j and each center image of Np m

5 User constraints (must-links between the most distant images and

cannot-links between the closest images) of all iterations

Deduced constraints (must-links between the most distant images

and cannot-links between the closest images) of all iterations

User constraints are created for each cluster in each iteration as follows: must-links are successively created between two positive images (at least one of them

is not selected by any must-link) that have the longest distance until all positive images of the cluster are connected by these must-links; cannot-links are created between each negative image and the nearest positive image of the cluster

Deduced constraints are created in each iteration as follows: must-links for each neighborhood are successively created between two images that have the longest distance until all images of this neighborhood are connected by these must-links; cannot-links are deduced, for each pair of cannot-link neighborhoods (Np i ;Np j ), between each image of Np i and the nearest image of Np j and between each image

of Np j and the nearest image of Np i

6 Same idea as in strategy 5, but the size of the neighborhoods is

considered while creating deduced cannot-links

User constraints and deduced must-link constraints are created as in Strategy 5 For each pair of cannot-link neighborhoods, deduced cannot-links are only created between each image of the neighborhood that has the least number of images and the nearest image of the neighborhood that has the most images

Trang 9

cannot-link ðx3;x4Þ is created, based on the must-link ðx1;x4Þ and

be created based on neighborhood information In our experiments,

we use different strategies for deducing pairwise constraints

4.2 Experimental results

4.2.1 Analysis of different strategies for deducing pairwise constraints

between images

The ﬁrst set of experiments aims at evaluating the performance

of our interactive semi-supervised clustering model using different

strategies for deducing pairwise constraints between images Note that constraints between CF entries should be deduced from con-straints between images, before being used in the re-clustering phase We use the Wang and the PascalVoc2006 image databases for these experiments For these two databases, we propose three test scenarios (note that c speciﬁes the number of clusters which are chosen for interacting in each iteration):

Scenario 1: c ¼ 5 closest clusters are chosen

Scenario 2: c ¼ 5 clusters are randomly chosen

Scenario 3: c ¼ 10, all cluster are chosen (Wang and Pascal-Voc2006 both have 10 clusters)

(a) Results on the Wang image database

(b) Results on the PascalVoc2006 image database

Fig 3 Results of our proposed interactive semi-supervised clustering model during 50 interactive iterations on the Wang and PascalVoc2006 image databases, using 6 strategies for deducing pairwise constraints The horizontal axis speciﬁes the number of iterations.

Trang 10

Note that our experiments are carried out automatically, i.e the

feedback is given by a software agent simulating the behaviors of

the human user when interacting with the system In fact, the

human user can give feedback by clicking for specifying the

posi-tive and/or negaposi-tive images of each cluster or by dragging and

dropping the image from a cluster to another cluster For each

clus-ter selected by the user, only 21 images of this clusclus-ter are displayed

(seeFig 1) Therefore, for interacting with 5 clusters (scenarios 1,

2) or 10 clusters (scenario 3), the user has to realize respectively

a maximum of 105 or 210 mouse clicks in each interactive

itera-tion These upper bounds do not depend on neither the size of

the database nor the pairwise constraint deduction strategy, and

in practice the number of clicks that the user has to provide is

far lower However, the number of deduced constraints may be

much greater than the user’s clicks (and this number depends on

the database size and on the pairwise constraint deduction

strat-egy) When applying the interactive semi-supervised clustering

model in the indexing phase, the user is generally required to

pro-vide as much feedback as possible for having a good indexing

structure which could lead to better results in the further retrieval

phase Therefore, in the case of the indexing phase, the proposed

number of clicks seems tractable

Fig 3(a) and (b) show, respectively, the results during 50

inter-active iterations of our proposed interinter-active semi-supervised

clus-tering model on the Wang and PascalVoc2006 image databases,

with the three proposed scenarios The results are shown according

Ta-ble 1 The vertical axis speciﬁes the V-measure values, while the

horizontal axis speciﬁes the number of iterations Note that with

each selected cluster, the user agent gives all possible feedback

Therefore, for each scenario, the numbers of user feedback are

equivalent between different iterations and between different

strategies As in scenario 2, clusters are randomly chosen, we

real-ize this scenario 10 times for each database The curves of the

of the V-measure over these 10 executions at each iteration The average standard deviation of each strategy after 50 iterations is

the average execution times of 10 executions are shown) The experiments are executed using a normal PC with 2 GB of RAM

We can see that the clustering results progress, in general, after each interactive iteration, in which the system re-clusters the data-set by considering the constraints deduced from accumulated user feedback In most cases, the clustering results converge after only a few iterations This may be due to the fact that no new knowledge

is provided Moreover, we can easily see that the clustering results are better and converge more quickly when the number of chosen clusters (and therefore the number of constraints) in each interac-tive iteration is higher (scenario 3 gives better results and con-verges more quickly than scenarios 1 and 2) In addition, for both image databases, scenario 2, in which clusters are randomly cho-sen for interacting, gives better results than scenario 1, in which the closest clusters are chosen When selecting the closest clusters there may be only several clusters that always receive user feed-back; thus the constraint information is less than when all the clus-ters could receive user feedback when we randomly select the clusters

As regards different strategies for deducing pairwise con-straints, we can see that for each database, the average standard deviations over 10 executions of the scenario 2 are similar for all scenarios Therefore, we can compare different strategies based

Strategy 1 shows, in general, very good performance but the processing time is huge because it uses all possible user con-straints and deduced concon-straints created during all iterations

Strategy 2, the only strategy uniquely using user constraints, generally gives the worst results; thus deduced constraints are needed for better performance Its processing time is also high due to the large number of user constraints

Strategy 3 shows good or very good performance but some oscillations exist between different iterations because, when overlooking previously deduced constraints, some important constraints may be omitted Its processing time is high

Strategy 4 gives better results than strategy 2, but the results are unstable because this strategy also overlooks previously deduced constraints It has good execution time while reducing the number of constraints

Strategy 5 generally gives good or very good results by keeping important constraints (must-links between the most distant images and cannot-links between the closest images), but its processing time is still high

Strategy 6, by reducing the deduced cannot-link constraints from strategy 5, gives in general very good results in low execu-tion time

We can conclude, from this analysis, that strategy 6 shows the best trade-off between performance and processing time This strategy will be used in further experiments

4.2.2 Comparison of the proposed semi-supervised clustering model and the semi-supervised HMRF-kmeans

Figs 4(a) and (b) represent, respectively, the clustering results for 50 interactive iterations on the Wang and the PascalVoc2006 image databases when using our proposed semi-supervised clus-tering and the semi-supervised HMRF-kmeans in the re-clusclus-tering

6, for deducing pairwise constraints between images, are used Note that the results of scenario 2 represent the mean values and

Table 2

Average standard deviation of 10 executions of the scenario 2 after 50 interactive

iterations corresponding to the experiments of our proposed interactive

semi-supervised clustering model shown in Fig 3 (a) and (b).

Average standard deviation

Table 3

Processing time after 50 interactive iterations of the experiments of our proposed

interactive semi-supervised clustering model shown in Fig 3 (a) and (b).

Wang database

PascalVoc2006 database

Định dạng
Số trang	13
Dung lượng	1,51 MB