Keyword Visual Representation for Image Retrievaland Image Annotation Nhu Van Nguyen*,‡, Alain Boucher*, † , §and Jean-Marc Ogier*,¶ * Lab L3I, University of La Rochelle, La Rochelle, Fr
Trang 1Keyword Visual Representation for Image Retrieval
and Image Annotation
Nhu Van Nguyen*,‡, Alain Boucher*, † , §and Jean-Marc Ogier*,¶
* Lab L3I, University of La Rochelle, La Rochelle, France
†IFI, MSI; IRD, UMI 209 UMMISCOVietnam National University, Hanoi, Vietnam
‡ nhu-van.nguyen@univ-lr.fr
§ alain.boucher@univ-lr.fr
¶ Jean-Marc.Ogier@univ-lr.fr
Received 1 December 2013 Accepted 13 April 2015 Published 24 June 2015
Keyword-based image retrieval is more comfortable for users than content-based image trieval Because of the lack of semantic description of images, image annotation is often used
re-a priori by lere-arning the re-associre-ation between the semre-antic concepts (keywords) re-and the images (or image regions) This association issue is particularly di±cult but interesting be- cause it can be used for annotating images but also for multimodal image retrieval However, most of the association models are unidirectional, from image to keywords In addition to that, existing models rely on a ¯xed image database and prior knowledge In this paper, we propose an original association model, which provides image-keyword bidirectional trans- formation Based on the state-of-the-art Bag of Words model dealing with image represen- tation, including a strategy of interactive incremental learning, our model works well with a zero-or-weak-knowledge image database and evolving from it Some objective quantitative and qualitative evaluations of the model are proposed, in order to highlight the relevance of the method.
Keywords : Image retrieval; image annotation; incremental learning; user interaction.
1 Introduction
Among existing image retrieval systems, there are two main categories of methodsused to search images: methods based on textual information and those based onvisual information The ¯rst category is based on the textual metadata of eachimage to search by keywords The second is based on the content of each image tosearch for images similar to those given by the user The text and the visual content
‡Corresponding author.
and Arti¯cial Intelligence
Vol 29, No 6 (2015) 1555010 (37 pages)
#.c World Scienti¯c Publishing Company
DOI: 10.1142/S0218001415550101
Trang 2corresponds to di®erent semantic levels The text handles more semantic mation, while the visual content is more perceptive These two types of informationare complementary and provide very di®erent aspects to search for images.Searching by keywords generally provides better results than by content in terms ofresponse time and accuracy Moreover, the formation of the query by examples ismore di±cult than the query by keywords, since the user must provide examples ofimages that are not always available and not always representative of all the userintentions This is a major problem in Content-Based Image Retrieval (CBIR)systems However, to perform query by keywords, annotation must be available forimages This approach requires a priori annotation of the image database, a verylaborious, time consuming and often subjective task Search by content is even-tually necessary when text annotations are missing or incomplete Furthermore,the content-based retrieval can potentially improve accuracy even if pre-textualannotation exists, particularly through the provision of informational content ofimages.
infor-In the context of specialized applications for image retrieval, there are many types
of complex systems in which the image database evolves in real time, generallystarting with zero-or-weak-knowledge This hypothesis is true for many applications
of supervision or control, i.e video surveillance, video monitoring, natural disastermanagement In most of these applications, the image volume is not ¯xed, and thedatabase typically contains old images and new incoming images, images which can
be already processed/indexed, and images waiting to be processed and indexed Inthis type of application, we have identi¯ed three di®erent possible query types forimage retrieval: (1) example image, (2) keyword, (3) example image and keywordcombined The last two types raise many problems, especially when all or a part ofthe image database is not annotated, making this as an inaccessible part throughtextual queries Moreover, most of the automatic knowledge learning methods gives abad subjective performance In this applicative context, the interaction betweenusers, domain experts and the system can be used to improve system knowledge,simply by clicking on few relevant/irrelevant images In this work, therefore, westudy (1) the interactive learning of associations between visual features and key-words and (2) the use of these associations for two applications Using these asso-ciations, we can propagate annotations to unlabeled images in the database(application: image annotation) The associations help also to represent textualquery by visual features, therefore, give our system the ability of retrieving unlabeledimages or new incoming images by textual query (application: image retrieval bytextual query or by mixed image-text query)
From a user's perspective, our work focuses on a user-oriented image retrievalsystem The main objective of an image retrieval system is to provide e®ective toolsfor browsing and searching for users, and it is essential for the system design to becentered on the human/user We believe that the understanding of the user inten-tions plays a key role in a retrieval system images
Trang 3We have identi¯ed three interactions levels between users and the image retrievalsystem:
. Level 0: The user has no well-de¯ned intention at the beginning and just wants toexplore the collection of images and to select images that he likes
. Level 1: The user is very clear about what he wants in the system During anexploration and retrieval session, he provides feedback to the system for quicklyleading to a satisfying ¯nal result
. Level 2: The user is very clear about what he wants in the system By providing thefeedback, he wants the system to learn the knowledge and memorize it to use again
in future sessions
In our work, we suppose that the user is an expert who has the knowledge aboutdata and the domain of the system From the user relevance feedback in long term,the system knowledge can be learnt from early life of the system without priorknowledge
From a scienti¯c perspective, two main issues are studied in this paper The ¯rstissue is to link content-based and keyword-based image retrieval Low-level content-based retrieval is coupled with high-level keyword-based queries from the user Thesecond issue concerns non/poorly annotated image databases Usually, an imageretrieval system works without knowledge or with a priori knowledge With theproposed learning method, the system knowledge can be constructed from zero Theknowledge of the system is based on the annotations (images represented by textualkeywords) but also on the visual content, in which keywords are translated intovisual features (text represented by images)
We present and discuss related works in Sec 2 We propose an original model,named BoK— \Bag of KVRs" (Keyword Visual Representation), which representsassociations between semantic concepts and visual features in the support of the well-known Bag of Words (BoW) model (Sec 3).24 An interactive and incrementallearning method is proposed for building the associations (Sec.4) This model helpsnot only to improve the performance of image annotation and image retrieval butallows retrieving images using textual queries in non/poorly annotated imagedatabases (Sec.6)
2 Related Work
The co-occurrence model proposed by Mori et al represents the ¯rst approach forassociations between text and image.20First, images of the training set are dividedinto regions that inherit all the keywords of original images from which they de-pend Visual descriptors are then extracted from each region All descriptors areclustered into a number of groups, each of which is represented by its center ofgravity Last, the probability of each keyword for each of the region groups can bemeasured
Trang 4Duygule et al.5proposed a translation model to represent the relationships tween text and content According to their view, visual features and text are twolanguages that can be translated from one to the other Thanks to a translation tablehaving estimations of probability of the translation between image's regions andkeywords, an image is annotated by choosing the most probable keyword for each ofregions.
be-Barnard et al.1 have extended the translation model of Duygulu et al.5 to ahierarchical model It combines the \aspect" model9which builds a joint distribution
of documents and features, with a soft clustering model1which maps documents intoclusters Images and text are generated by nodes arranged in a tree structure Thenodes generate image regions using a Gaussian distribution, and keywords using amultinomial distribution
Jeon et al.11suggested improvements to the results of Duygulu et al.5by ducing a language generation model, called Cross-Media Relevance Model (CMRM).First, they use the same process as Duygulu et al for calculating the representation
intro-of images (represented by blobs) Then Duygulu et al made the assumption thatthere is a one-to-one correspondence between regions and words, while Jeon et al.assume that a set of blobs is related only to a set of words Thus, instead of seeking aprobabilistic translation table, CMRM simply calculates the probability of observing
a set of blobs and keywords in a given image
Lavrenko et al.13 proved that the process of features quantifying using a tinuous-space Relevance Model (CRM) can avoid losing information related to theproduction of the dictionary in the CMRM model.11 Using continuous features ofprobability density to estimate the probability of observing a particular region in animage, they showed that the model performance on the same dataset is much moree±cient than the models proposed by Duygulu et al.5and Jeon et al.11
Con-Some studies have attempted to use the LSA technique for combining visual andtextual features, including Hofmann9and Monay and Gatica-Perez19 who appliedthe Probabilistic Latent Semantic Analysis for automatic image annotation Withthis approach, text and visual features are considered as \terms" It assumes thateach term may come from a number of latent subjects, and each image can containmultiple subjects
In the transformation model,15 the text query is automatically converted intovisual representations for image retrieval First, the relationship between text andimages are taken from a set of images annotated with text descriptions A trans-media dictionary which is similar to a bilingual dictionary is set up in the training set.Chang and Chen3propose to do the opposite, which is to translate an image queryinto a text query Based on both textual and visual queries, the authors transformvisual queries into textual queries, and acquire new textual queries After that, theyapply text retrieval techniques to deal with initial textual queries and new textualqueries constructed from the visual query for image retrieval Finally, they merge theresults
Trang 5Recently, nearest neighbor methods which treat image annotation as image trieval problem, have received more attention Makadia et al.17introduce a baselinetechnique that transfers keywords to images using its nearest neighbors A combi-nation called Joint Equal Contribution (JEC) of basic distances to ¯nd nearestneighbors is used on low-level image features; the keywords are then assigned using agreedy label transfer mechanism A more complex nearest-neighbor-type modelcalled TagProp is proposed by Guillaumin et al.8The model combines a weightednearest-neighbor approach with metric learning capabilities in a discriminativeframework which allows the integration of metric learning by directly maximizingthe log-likelihood of the tag predictions in the training set In Ref.29, the authorspropose to use both similar and dissimilar images together with a group-sparsity-based feature selection method The paper provides an e®ective way to select features
re-in image annotation task, which have not been well re-investigated before
is an extremely delicate operation in general
Unavailability of the transformation from text to images The tion from text to images is very useful for nonannotated image retrieval with atextual query We can search an image database which is not annotated or partiallyannotated by using a textual query if we turn it into a visual query Most methods oftext/image association are designed to transform low-level visual features into key-words (image annotation) Only the transformation model15proposes the conversionfrom text to images This model provides a visual representation of keywords usingmutual information between text and blobs, o®ering the ability to annotate imagesand search for images by text However, one disadvantage of this model is theproblem of image segmentation
transforma-Constraints on the availability of knowledge Whatever the text/image ciation models can be, the existence of a priori knowledge, often represented asannotations, is absolutely essential for the learning phase This phase is extremelytime consuming for the user, and particularly complex for specialized applications.The association learning phase is mostly o®-line and the problem is notably di±cult,especially when the knowledge evolves, for example, to integrate new knowledge Theperformance of these models is not easy to improve The problems of developing
Trang 6approaches to enrich the system knowledge therefore are particularly crucial, andthese approaches should be possible without requiring o®-line calculations.
2.2 Our proposed approach
As part of this work, we aim at giving answers to issues raised in the previous section.First, in order to avoid the dependence on the quality of the segmentation phase, weare working in a context without segmentation In order to support image retrieval
by textual queries independently of any manual annotation, we propose to add thebidirectional transformation between text and image Finally, we place ourselves in asystem with incremental knowledge learning, which requires no special knowledge atthe beginning of the life of the system This constraint seems essential, but alsorealistic, because most applications do not have specialized knowledge in their earlylife
In our model, text/image associations are learnt by an incremental learningmethod via relevance feedback without any knowledge at ¯rst Unlike other modelswhere prior knowledge is available, in our system, knowledge comes from userinteractions Therefore, our system knowledge is progressively improved over timethrough interactions, without requiring any o®-line learning stage
We ¯rst summarize here the assumptions on which we rely for the development ofour model and its context of use:
. the work is done on a large image database without prior knowledge;
. the volume of images is not ¯xed, and new incoming images are added over time;. the system knowledge is based on the annotation of images (images represented bytext keywords) plus some learnt representation of keywords (text represented byvisual features);
. interactive learning can be done in reinforcement and/or incremental way;. the interaction between the users, the domain experts and the system to improveoverall system knowledge should be done through simple clicks for relevant/irrelevant images;
. there is very few training data (reinforcement/incremental); the number of imagesclicked at each interaction must be low (maximum 20);
. image annotation propagation is performed in real time
Table 1 Comparison of existing text/image association models with our proposed model described in this paper.
System Image-to-Text Text-to-Image Multimodal Retrieval Knowledge Source
Latent Semantic Analysis9,19 Yes No Yes a priori Transformation model 3 , 15 Yes Yes Yes a priori + WordNet
Trang 7Table 1 gives a comparison between all the presented text/image associationmodels and our model.
3 A Bidirectional Association Model between Text and Image
In this section, we propose an association model between text and image referred as
\BoK model" (Fig.3) where a KVR is a possible visual representation for a keyword
To avoid image segmentation, we use the famous BoW model to represent images
In this model, the image is not represented using regions but using points ofinterest (detailed in the next section) Another reason for using the BoW model isthe e®ectiveness of this model, con¯rmed by current trends of research on thismodel.24 , 25 , 27
3.1 Keyword Visual Representation
KVR is an upgraded de¯nition of the Bag of Words (Bag of visual Words) tation.14 , 24In the BoW representation of an image, visual words are based on imagefeatures such as interest points, regions, etc A dictionary of visual words is built using aclustering method over a big set of features, where each visual word represents a group
represen-of similar features An image is represented as a histogram represen-of visual words (a BoW) Inour work, we use the SIFT descriptor16; the dictionary is constructed using thek-meansmethod and the BoW presentation is based on theTF IDF weighting scheme.Basically, a KVR is a BoW representation of a region (or a set of similar regions)corresponding to a keyword For example, let us consider a BoWVIcontaining all the
n visual words viof an imageI
VI¼ ðv1; v2; ; vnÞ; vi2 I:
Let us suppose that this image I has N di®erent regions, R1; ; RN which spond toN keywords (objects) K1; ; KN Then we can divideVI intoN di®erentBoWsVI1; ; VIN, respectively corresponding toK1; ; KN
corre-VI1[ VI2[ [ VIN¼ VI:
We considerVIias a possible visual representation for keywordKi(a KVR ofKi).The KVR construction is easy if regions are available (i.e image segmentation, seeFig.1) However, due to the problem of image segmentation, our approach tries to
¯nd the representative visual words (KVR) of each concept in an image withoutsegmenting it The construction of KVR and BoK in our system is presented in thefollowing sections
3.2 BoK model
A KVR is a BoW representation of a region corresponding to a keyword Consideringthe fact there exists several visual representations for one keyword, a keyword can
Trang 8correspond to several regions of the image, as one can see in Fig.2 In Fig.2, the word
\sky" can be interpreted into three di®erent types of sky: clear sky (blue), sky withclouds (white) and sunset (red sky) In our model, the BoK is created with theassumption that a keyword matches one or more di®erent image regions A keyword
Fig 1 Image with di®erent regions corresponding to di®erent concepts (Sky, Helicopter, Human, Sea) Each region corresponds can be represented by a Bag of visual Words, or a possible KVR for the concept.
Fig 2 The BoK representation The keyword \Sky" could be one of three types \Clear", \Cloudy" and
\Sunset" The keyword \Sky" is then represented by a bag containing three correspondent KVRs.
Trang 9is then represented by a BoK in which a KVR (a bag of visual words) corresponds to
a set of similar regions
To construct a KVR from a set of similar regionsSr, we construct the BoW whichincludes the most frequent visual words inSr In fact, all possible visual words do nothave a real meaning and only a number of them are really signi¯cant for charac-terizing a concept In order to tackle this problem, we can use two methods First, wecan use a simple threshold to identify the most frequent visual words; this method isused in our experimentation To be more robust, we can base on the Zipf distribution
of visual words frequencies in Sr This method is presented in our contribution inRef.22
The visual similarity between two keywords or a keyword with an image is thevisual similarity between two Bags of KVRs To compare the visual similarity be-tween two Bags of KVRs we de¯ne a similarity function as follows Consider twoBags of KVRsB1,B2:
i is thejth KVR of Bi,k1, k2 correspond to the number of KVRs in B1
andB2
The visual similarity betweenB1 andB2 is de¯ned:
Sim visualðB1; B2Þ ¼ maxðSim visualðKVR1
In Fig.3, the BoK representation can be summarized as follows:
(1) Each image is represented by a bag of visual words (BoW model)
(2) Each group of similar regions of the images in a category is represented by aBoW that we call KVR
(3) With the assumption that a keyword matches one or more regions in the images,
a keyword is represented by a BoK
The BoK representation is used for image annotation or image retrieval It isparticularly e®ective when image annotations in the database are insu±cient becausethe textual queries cannot be represented by visual features (transforming keywordsinto images is the second issue raised in the state-of-the-art, see Sec.2.1) The BoK is
a transformation model as the model of Ref 15 While this model15 uses mutualinformation and image segmentation to transform a textual query into a visualquery, our model takes advantage of the e±ciency and simplicity of the BoW model
to represent the textual query by visual features, which in this case is the BoWrepresentation Thus, our model can take advantage of the e±ciency of the BoWmodel and can avoid the problem of image segmentation
Trang 103.3 KVR operators
A KVR is constructed from a set of similar regions which is updated during the use ofsystem Therefore, during the learning of BoK through interactions with users,KVRs in a Bag can be merged into a new KVR or a KVR could be divided into twoKVRs These actions are based on the similarity between KVRs which we call theEQUAL between two KVRs For manipulating KVRs, we de¯ne four operators:ADD, EQUAL, MERGE and SPLIT The four operators are used by the incre-mental/reinforcement learning of BoK model in the next section The operatorsADD, MERGE and SPLIT are used for the revision of the BoK model, while theEQUAL operator is used for the analysis of the BoK model We describe in detail theproposed algorithm and the use of operators in the next section
ADD The ADD operator is used to add a KVR into a bag as long as it is di®erentfrom all existing KVRs in the bag This condition is veri¯ed by the operator EQUALbelow
EQUAL The EQUAL operator is used to determine whether two KVRs areconsidered close enough to be combined We propose to use a standard equivalent
Fig 3 The BoK model for the text/image association (our contribution in red, dashed lines) based on the existing BoW representation (blue, continuous lines) (color online).
Trang 11of the Ward criterion26: the increase in variance This criterion26 is a widely usedcriterion for hierarchical clustering and it is frequently considered that this methodoutperforms other methods in several comparative studies.10
SupposingS1andS2are the sets of similar images that correspond to KVR1 andKVR2
EQUALðKVR1; KVR2Þ ¼ 1if:
VarianceðS1; S2Þ <¼ VarianceðS1Þ þ VarianceðS2Þ; ð2Þwhere VarianceðSiÞ is the variance of the BoW representations of the images in Si,and are the weights relative to the two KVRs ( þ ¼ 1) In our experiments
¼ ¼ 0:5
MERGE The MERGE operator is used to merge two similar KVRs The merger
of KVR1 and KVR2 is based on the number of elements inS1andS2, a criterion thatde¯nes their importance with the assumption that the KVR represented by thelargest number of image regions is more important than the KVR represented byfewer image regions:
KVR3 ¼ MERGEðKVR1; KVR2Þwith
KVR3 ¼ðn1 KVR1 þ n2 KVR2Þ
n1þ n2
wheren1 andn2 are the number of elements inS1 andS2
SPLIT The SPLIT operator is used to split a KVR into two KVRs or more to calculate a BoK One can make the division of one KVR or the whole BoK whenthere are changes The division of a KVR into two KVRs takes place only after themerger of two KVRs (MERGE operator) according to a given criterion Using asimilar criterion as the merge but di®erent parameters, the condition of division isbased on the variance of the KVR:
re-VarianceðKVR1Þ > ðVarianceðKVR2Þ þ ð1 ÞVarianceðKVR3ÞÞ; ð4Þwith ¼ n1=ðn1þ n2Þ
KVR1 is divided into two other KVRs using a clustering method In ourexperiments, we have compared adaptive k-means12 and competitive agglomera-tion.7The competitive agglomeration gives a performance that is slightly better thanthe adaptive k-means However, we observe that the clustering method does nota®ect performance much
SPLITðKVR1Þ ¼ CompetiveAgglomerationðS1Þ ¼ KVR1 þ KVR2:
Trang 124 Interactive and Incremental Knowledge Learning
Before discussing the BoK learning, we summarize here the context of our model Let
us recall the assumptions that the image database is big, and it evolves over time,without prior knowledge We devote this section to knowledge learning from the userinteractions Three types of knowledge can be distinguished in our system:
(1) knowledge coming from human and related to manual annotation: images areassigned keywords by experts during interactions
(2) knowledge coming from machine and related to annotation propagation: moreimages in the database are assigned keywords dynamically using BoK
(3) the \image level" knowledge: the BoK, learnt from annotations
4.1 Incremental knowledge learning
We have chosen a reinforcement incremental learning method to solve the problemidenti¯ed in the literature, the availability of a priori knowledge In contrast to othermodels of text/image association that require a priori knowledge, our model learnsthe BoK of keywords without requiring prior knowledge Knowledge in our modelcomes from the interactions of users/experts during image exploration and imageretrieval
BoK learning is illustrated in Fig.4, including an incremental learning loop and
an interaction loop between users/experts and the machine for each learning phase.After an interaction loop, BoK and annotations of images are updated incrementally
Fig 4 Start-up phase for BoK learning The big loop in dashed blue lines is BoK learning and annotation re¯nement The small loop in continuous red lines is interactions with users (color online).
Trang 13We identify two phases for the incremental learning: the start-up and the updatephases.
Start-up phase Initially, we assume that the database does not contain anyannotation Users only search for images by content By exploring images, they may
¯nd some of them interesting and want to tag images with keywords, which could bereusable by themselves or by others The system will use this information to createnew BoK
Let us considerSRa set of relevant images to a keywordK which is given by theuser To manually identify distinct groups of images, the user must do a lot ofinteractions which is di±cult to accept We, therefore, propose to use a clusteringmethod to cluster the images automatically inton di®erent groups representing
where KVRiis the KVR corresponding to the image clusterci, KVR existing is theexisting KVR— in this case zero, jcij is the number of images in cluster ciand BoWj
is the Bag of Words corresponding to imagej in ci In this case ¼ 0 and ¼ 1.Figure5shows an example of the start-up phase for learning the keyword \sky".First, by exploring images the user can ¯nd some images and mark them with thekeyword \sky" In Fig.5, we can see that users can ¯nd images of di®erent types butremaining consistent with the keyword \sky": sunset sky, blue sky, clear sky Aclustering method is used to cluster the images automatically into three di®erentgroups representing the keyword \sky" This makes it easier for the user, throughthis clustering strategy KVRs in the BoK of \sky" are built using Rocchio'sformula.23
Update phase For this phase, we assume that the manual/automatic annotationand BoK of keywords are available in the database The user can perform imageretrieval by content and text Based on user interaction to identify relevant/irrele-vant images to the query, the BoK and also image annotation are then updated.Let us consider SR, SI sets of relevant and irrelevant images to a keyword Kwhich is given by the user Using a classi¯cation method, we groupSRandSIintonclusters corresponding ton existing image clusters of K
ClassificationðSR; SIÞ ¼> ðc1; c2; ; cnÞ:
From thesen clusters, new KVRs for K are eventually created by using Rocchio'sformula.23The BoK forK is updated from new KVRs A merge of the two KVRs is
Trang 14made if there exists a new KVR similar to an existing KVR in the BoK This is doneusing the MERGE operator (see Sec.3.3) Otherwise, the new KVR is added (ADDoperator) to the BoK.
Figures6 and7show examples of the update phase for learning the BoK sponding to the keyword \sky" Assuming that the keyword \sky" has a BoK, andassuming the fact that we want to update this BoK, using textual query or querymixing text/images, the user can modify or validate retrieval results With relevant/irrelevant images selected by the user during the relevance feedback, new KVRs for
corre-\sky" are eventually created In Fig.6, the new KVR for \sky" is similar to a KVR inthe existing BoK A merge of the two KVRs is made In Fig 7, the new KVR isdi®erent from all existing KVRs, so the KVR is added to the BoK
Fig 5 Update phase for BoK learning.
Trang 15In our method, an important step is clustering Clustering is used to clusterrelevant/irrelevant images into separate groups These distinct groups are used toconstruct di®erent KVRs The quality of the clustering result determines the quality
of the KVRs In our system, as mentioned above, the number of groups is unknown
We are interested in clustering methods which can determine the number of groupsindependently To do this, we tested two clustering methods: adaptivek-means12and
Fig 6 Update phase for BoK learning: Merge of two KVRs (merge of the two image groups of the two KVRs).
Fig 7 Update phase for BoK learning: Addition of a new KVR.
Trang 16competitive agglomeration.7 These two methods were chosen for their ability toautomatically determine the number of groups In our experiment, better results areobtained using the competitive agglomeration method.
4.2 Algorithm
In our proposed learning, the model is updated whenever a new signi¯cant experience(session of interaction, relevance feedback) becomes available The model is updatedover time (continuously) as the experiences happen in sequence This feature showsthe \incremental" character of the learning Since the learning uses reinforcementinformation given by user interactions (these are relevant/irrelevant images offeedback), we believe this learning is a type of reinforcement learning
We have identi¯ed two phases for the learning process In this section, we propose
a learning algorithm, corresponding to the two phases presented above In the
start-up phase, the initial KVRs are constructed based on the user's information, whichcorresponds to images labeled with keywords provided during user interaction In theupdate phase, the information provided (relevant/irrelevant images and the previousimages used in the start-up phase) is used to update the KVR In each experiment,the learning algorithm performs two tasks: model analysis and model update usingthe operators presented in Sec 3.3 The task analysis is to check the conditionsfor the update operations, and the task of revision is to execute operators to updatethe model
With the assumption that there is no prior knowledge at beginning, the users can
do nothing else but to explore images or search images by content By exploringimages, users can ¯nd some interesting and label them with keywords reusable forthemselves or for other users The images labeled with the keywordK are clusteredintok groups using a clustering method Each query point is created in features spacefrom a group by using technique of relevance feedback.21The multi-point query for
K is updated until satisfaction of the user We consider each point of the last query as
a KVR (New KVR) of the keywordK
After Bags of KVRs of keywords are created, the system can search images usingthese keywords In this case, users can use image query, query mixed text/image orkeyword-based query (Fig 6) In our experiments, we have limited the number ofkeywords presented in a queryQ to ¯ve Users will provide the relevant feedback for
Q which are relevant/irrelevant images for a keyword or several keywords in Q.21Atthe end of each interaction session the queryQ has (up to) k points (in feature space).Each point ofQ is a visual representation possible (New KVR) for a keyword At thisstage, we assume that a keywordK have N KVRs The BoK of K is then updatedusing the KVR operators The new candidate KVR of the keywordK is comparedwith all other KVRs existing in the BoK of K If New KVR is not considerablydistinct from a KVR, two KVRs are merged into a single KVR and conversely ifNew KVR is signi¯cantly di®erent from all the existing KVR, then a new KVR of thekeyword K is added
Trang 175 Application of BoK
5.1 Automatic annotation propagation
Annotation propagation is used to label nonannotated images with keywords Whilemanual annotation requires a lot of e®ort from users, annotation propagation can beperformed automatically In our system, annotation propagation is updated whenBags of KVRs are updated Thus, when a KVR of a keyword K is updated, thesimilarities between the keywordK and nonannotated images are calculated Withthe annotating images, we state the hypothesis that only few keywords (up to ¯ve)have signi¯cant meaning regarding the content of the image in general We havelimited the automatic annotation of images to a maximum of ¯ve keywords perimage (can be less but no more)
So if the keywordM is in the ¯ve most relevant to the image (the closest in theKVR space), then the keywordK is assigned to this image, and the sixth keyword isomitted Thus, each image always has the ¯ve most relevant keywords as annotation.The annotation is improved with the evolution of the BoK of keywords
We have improved the annotation propagation by incorporating the correlationsbetween the keywords in the similarity function of KVR and image In our case, the
Algorithm
Input: A visual queryQ v or a textual queryQ t
Output: BoK updated
Begin
Step 1 CBIR
Step 2 Relevance feedback (to the satisfaction of the user)
User assignsN images with keyword M
Cluster N images into k groups
Compute the k new sub-queries using Rocchio’s technique
Sub queryi = Modify query(Q, technique Rocchio)
KVR newi = the last sub-query Sub queryi
Step 3 Update (analyse and revision)
For each new KVR: KVR newFor all KVRs of keywordM: KVR existed
If EQUAL(KVR new, KVR existed) = 1
KVR merge = MERGE(KVR new, KVR existed)
If KVR merge can be dividedSPLIT(KVR merge)Else
ADD(KVR new)Step 4 Back to step 1
End
Trang 18similarity between a KVR and an image is calculated based on the correlation tween the keywords The correlation between the two keywords K1 andK2 is cal-culated as the probability PðK1; K2Þ for these two keywords to be presented asannotation for the same image These probabilities of a pair of keywords can becalculated by learning from a set of training data or by using ontology such asWordNet.aBased on the integration of correlations between keywords, the similaritybetween an image and a KVR is calculated as follows:
be-(1) The nearest keywordK1 will have the similarity with the KVR:
SimðK1; KVRÞ ¼ Sim visualðK1; KVRÞ:
The similarity Sim visual(K1, KVR) is calculated using Eq (1)
(2) The next keyword Kn will have the similarity: SimðKn; KVRÞ ¼Sim visualðKn; KVRÞQn1
i¼1 pðKi; KnÞ
5.2 Image retrieval using textual query
Above-mentioned problems of image retrieval which are missing annotations andquery formation can be solved by using the BoK model In our system, we canperform the image retrieval using textual queries while textual information is notavailable initially in the database By taking advantage of user interaction, thesystem builds its knowledge, or in other words, annotation at di®erent levels, whichchanges over time and allows users to continue using the system We can auto-matically transform a textual query into a visual query by using BoK (Fig.8) Thus,for a partially annotated image database or a new database without knowledge/annotation we can use the textual query as soon as the ¯rst KVRs are built (that is tosay, from the partial annotation of the early interactions) Initially, for an imagedatabase without annotation, conducted with few interactions, and therefore, fewKVRs built, the results are obviously not the best, but they have the merit to existand to o®er the ability to use the system from scratch In addition, these resultsimprove gradually along with the use of the system (that is to say, the constructionand re¯nement of KVRs) As more sessions of interaction are done, more images aremanually annotated and more annotations are propagated to the image databaseand then the better results are obtained from textual querying
a WordNet: http://wordnet.princeton.edu/.
Fig 8 CBIR by using BoK model.