Hence, despite a large number of retrieved documents, the user, who is given only a page of documents at a time, does not know if he/she can find documents about the fruit topic of “appl
Trang 1Adaptive Dual Control of
Topic-Based Information Retrieval
A dissertation submitted in partial fulfilment of the requirement
for the degree of Doctor of Philosophy
by
Vitaliy Vitsentiy
M.Sc Ternopil Academy of National Economy, Ukraine
School of Information Technology Faculty of Science and Technology Queensland University of Technology
Brisbane, Australia
2009
Trang 3Keywords
topic-based information retrieval, dual control, stochastic programming
Trang 5This thesis presents an adaptive IR system based on the theory of adaptive dual control The aim of the approach is the optimization of retrieval precision after all feedback has been issued This is done by increasing the diversity of retrieved documents This study shows that the value of recall reflects this diversity
The Probability Ranking Principle is viewed in the literature as the “bedrock”
of current probabilistic Information Retrieval theory Neither the proposed approach nor other methods of diversification of retrieved documents from the literature conform to this principle This study shows by counterexample that the Probability Ranking Principle does not in general lead to optimal precision in a search session with feedback (for which it may not have been designed but is actively used)
Retrieval precision of the search session should be optimized with a multistage stochastic programming model to accomplish the aim However, such models are computationally intractable Therefore, approximate linear multistage stochastic programming models are derived in this study, where the multistage improvement of the probability distribution is modelled using the proposed feedback correctness method The proposed optimization models are based on several assumptions, starting with the assumption that Information Retrieval is conducted in units of topics
Trang 6The use of clusters is the primary reasons why a new method of probability estimation is proposed
The adaptive dual control of topic-based IR system was evaluated in a series
of experiments conducted on the Reuters, Wikipedia and TREC collections of documents The Wikipedia experiment revealed that the dual control feedback mechanism improves precision and S-recall when all the underlying assumptions are satisfied In the TREC experiment, this feedback mechanism was compared to a state-of-the-art adaptive IR system based on BM-25 term weighting and the Rocchio relevance feedback algorithm The baseline system exhibited better effectiveness than the cluster-based optimization model of ADTIR The main reason for this was insufficient quality of the generated clusters in the TREC collection that violated the underlying assumption
Trang 7Table of Contents
Keywords iii
Abstract v
Table of Contents vii
List of Tables xiii
List of Figures xv
List of Acronyms and Abbreviations xvii
Basic Mathematical Notation xix
Statement of Original Authorship xxiii
Acknowledgements xxv
I Introduction 1
1 Motivation 1
Importance of IR 1
Empirical Evidence of Problems in IR 1
Uncertainty as the Main Cause of the Problems 4
Feedback in IR and its Problems 5
Summary 7
2 Definition of Adaptive Dual Topic-Based IR 7
Adaptive Dual IR 7
Topic-Based IR 9
Trang 8The Proposed Vision of IR 11
Summary 11
3 A Counterexample to PRP 12
Relevance for a Minority User 12
Expected Relevance across all Users 14
Summary 16
II Design of the Research 17
1 Methodology of the Research 17
Guidelines 17
Hypotheses of the Research 18
Contributions 19
Summary 21
2 Taxonomy of the Research Problem 21
Information Retrieval 21
Adaptive Dual Control and Stochastic Programming 21
Theory of Algorithms 22
Artificial Intelligence 23
Machine Learning 23
Summary 24
3 Outline of the Further Narrative 24
III Review of Probabilistic and Topic-Based IR 26
1 Probabilistic Approaches to IR 26
Probability Ranking Principle 26
Probabilistic Models 27
Language models 28
Summary 30
Trang 92 Topic-Based IR 31
Latent Semantic Analysis 31
Cluster model 32
Probabilistic Latent Semantic Analysis 34
Latent Dirichlet Allocation 35
Summary 39
3 Feedback in Adaptive IR 40
Feedback for Vector Space Models 40
Feedback for Probabilistic Models 41
Feedback for Language Models 41
Feedback for LSA model 42
Summary 42
IV Review of Uncertainty-Related Methods 43
1 Problems of Uncertainty and Diversity in IR 43
Uncertainty in IR 43
Diversity in IR 45
Evaluation of Diversity in IR 46
Summary 47
2 Approaches to Tackle Uncertainty and Diversity in IR 48
Diversity Stimulation 48
Multicriterion Matching Scores 49
Active Learning 50
Reinforcement Learning 52
Summary 52
3 Adaptive Dual Control and Stochastic Programming 52
Adaptive Dual Control 52
Trang 10Direct Methods 54
Indirect Methods 55
Stochastic Programming 56
Summary 58
V Relevance Estimation 60
1 Probability Estimation 60
Modelling Probabilities Based on Searched Features 60
The Language Modelling Approach 62
The Document Sampling Approach 62
Probabilistic User-based Model 65
Summary 67
2 Expected Relevance 67
The General Approach 67
Smoothing 67
Learning Topic-Relevance and Bias Coefficients 68
Feedback 70
Summary 71
VI Decision Optimization 73
1 Two-stage Stochastic Program 73
Optimization in Space of Documents 73
Optimization in the Space of Clusters 75
Approximate Formulation 77
Relaxed Approximate Formulation 78
Linear Approximate Formulation 80
Summary 80
2 Multistage Stochastic Program 81
Trang 11Feedback Correctness Approach 82
Linear Equivalent 84
Summary 91
VII Experiments 92
1 Experimental Design 92
The Goals 92
Document Collections 93
Queries and Relevance Judgments 94
Baseline Systems 95
ADTIR Systems 96
Evaluation Measures 97
Reuters experiment 97
Software 98
Summary 98
2 Results and Discussion 99
Reuters Experiment 99
Wikipedia Experiment 99
TREC Experiment 103
Summary 106
VIII Conclusions 108
Addressed Problems 108
Findings 109
Further Research 110
A Derivation of Expected Relevance in the Example 111
B Retrieved Documents for Query “Apple” 114
C Query Set for Reuters Experiment 121
Trang 12D Candidate User-based Model Functions 123
E An Example of Program Output 124Bibliography 129
Trang 13List of Tables
Table 1 Estimated relevance of one document 12
Table 2 Actual relevance of one document 13
Table 3 PRP-based approach Case of the user’s given relevant topic 13
Table 4 Not-PRP-based approach Case of the user’s given relevant topic 13
Table 5 Comparison of the results for the user’s given relevant topic 14
Table 6 Second page PRP-based approach 15
Table 7 Second page Non-PRP-based approach 15
Table 8 Document Collections 93
Table 9 Parameters of the TREC’s experiment baseline system 96
Table 10 Parameters of the ADTIR systems 96
Table 11 Topic-relevance and bias coefficients 99
Table 12 Bias coefficients only 99
Table 13 Retrieval effectiveness in Wikipedia experiment 100
Table 14 Retrieval effectiveness in TREC experiment 104
Table 15 PRP-based approach 111
Table 16 Not-PRP-based approach 113
Trang 15List of Figures
Figure 1 Dynamics of adaptive IR with feedback 6
Figure 2 IR system as a control system 6
Figure 3 Adaptive Dual Information Retrieval System as a control system 9
Figure 4 Algorithm of ADTIR 10
Figure 5 Comparison of the results for an average user 15
Figure 6 Graphical model representation of PLSA 34
Figure 7 Graphical model representation of LDA 37
Figure 8 Precision in Wikipedia experiment 101
Figure 9 S-Recall in Wikipedia experiment 101
Figure 10 Precision on iterations in Wikipedia experiment 102
Figure 11 S-Recall on iterations in Wikipedia experiment 103
Figure 12 Precision in TREC experiment 105
Figure 13 Recall in TREC experiment 105
Figure 14 Precision on iterations in TREC experiment 106
Figure 15 Recall on iterations in TREC experiment 106
Trang 17List of Acronyms and Abbreviations
ADIR Adaptive Dual Information Retrieval
ADTIR Adaptive Dual Topic-Based Information Retrieval
LSA Latent Semantic Analysis
LSI Latent Semantic Indexing
PDF Probability Density Function
PMF Probability Mass Function
PRP Probability Ranking Principle
SVD Singular Value Decomposition
Trang 19Basic Mathematical Notation1
Trang 20f a component of function of x i g with coefficient b that models v
selection of query terms by the user;
a ′ topic-relevance between topics z and z′ multiplied by the sample
bias coefficient and used in the optimization algorithm, usually is found from the learning query set;
Trang 21N set of natural numbers
g ′, y zt z′ number of documents to retrieve from cluster z if according to
feedback topic of cluster z′ is relevant;
u
ρ probability of correct feedback on a topic if number of retrieved pages
with a document on a topic is u;
u
c a combination of numbers of retrieved documents on the iterations up
to u
Trang 23Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made
Signed: Date
Trang 25Acknowledgements
We sincerely thank the supervisor of this research, Prof Peter Bruza, for his continuous support throughout the dissertation work, and especially at those times when the preliminary research results have been difficult to convey to other people
We also thank the associate supervisor, Prof Amanda Spink, for her encouragement
to work on this research subject at QUT We are grateful to Prof Anatoly Sachenko and Prof George Markowsky, with whom we worked previously on this research subject Furthermore, we are in debt to our proofreaders
This research has been supported by NICTA and QUT with the PhD scholarship, equipment and an excellent working environment
Trang 27For Olya
Trang 29it is also important but often even more difficult to search for necessary articles For example patents are difficult to search because their owners often wish to obfuscate claims in the pursuit of their commercial interests In any case, it is sometimes difficult for the user to optimally represent his/her Information Need (IN) as a query formulated for the IR system In such conditions a manual search is inefficient, and
so the role of support from the IR system is very important
Empirical Evidence of Problems in IR
The following empirical evidence of problems that exist in the current level
of support from web IR systems provides motivation for this research and illustrates its research problems Although these familiar issues are evidenced in the literature review as well, these research problems have not yet been solved and are still active research topics within the IR research community
Trang 30The first IR problem considered is that of “insufficient diversity” Take
query “apple” as an example With such a query on the web, there are usually only documents about Apple computers in the first pages of the retrieved documents and none about the fruit, tree, and other “apple” concepts (see appendix B) The query
“apple” is equally suitable for all of these topics, but only documents about Apple computers are retrieved in the top part of the ranking Either more precise re-formulation of the query or browsing through further pages down the ranking is necessary to encounter other topics related to “apple” other than the computer topic The resulting web search is more suitable for users interested in Apple computers than for users interested in other topics related to “apple” There are many web pages about apple fruits, apple trees, apple dishes, even a political party called “apple” Their existence indicates that they also have users who may want to search for them These users are left unsatisfied with the initial query result and must continue the search So subjective assessment suggests that, when ranking documents for a given query, IR systems retrieve documents suitable for some users and do not take into account all users
Recent IR research papers [1], as well as sub-chapters IV.1, and IV.2, cover the problem of insufficient diversity and recognize the need for more diverse documents
Other evidence of the above problem comes from the fact that almost all current document collections for evaluation of IR systems have one number as a relevance judgment for each query and document [1] That is, each judgment cannot represent all the topics of the query that cover the desired outcomes for different users The aim of traditional IR systems is to get the best performance for one topic only per each query In order to improve this, document collections for IR evaluation must include graded relevance judgments that would cover a wider selection of relevant topics
In the case of multitasking IR [2], there are a few relevant topics In this situation the topics are not alternatives for a given user, as considered in the example above, but are often mutually complementary Therefore more diverse retrieved documents better satisfy such information searches, even for individual users For
Trang 31example, in submitting the query “apple”, an individual user may be interested in both fruits and apple recipes
Often many documents are selected by the IR system as relevant to the query and are retrieved for the user For example, there are 586000000 retrieved documents for the query “apple” (see appendix B) However, there are no documents about the fruit in the first and second pages of retrieved documents and only a few documents
on the third page Hence, despite a large number of retrieved documents, the user, who is given only a page of documents at a time, does not know if he/she can find documents about the fruit topic of “apple” in subsequent pages Often even after the query is modified there are too many retrieved documents and the relevant ones are scattered too sparsely to be conveniently reviewed, so that the user must continue modifying the query or browsing through many further pages of results Furthermore,
if the user provides feedback after viewing a page then the IR system will retrieve another set of documents, and so he/she may no longer be able to retrieve documents that were in subsequent pages down the ranking before the feedback was provided
So, besides the possible poor relevance estimation by the IR system, the pages retrieved to the user do not provide him/her a full picture of all retrieved documents
A better way would be, for example, to provide topics that can give a better overall picture of the documents present in the ranking
The above examples also reveal another problem, known as “vocabulary mismatch” [3] Some words in the query that seem appropriate from the user’s point
of view may not retrieve relevant documents; however if the specific terminology that occurs in the relevant documents is used in the query, then the relevant documents can be retrieved Therefore the retrieval result is very sensitive to the vocabulary used; in order to get the desired outcome, one has to use the correct keywords The problem is that the user may not know those keywords so that he/she submits a query that only approximately describes the IN IR systems designed to achieve maximal relevance for a correct query, will retrieve non-relevant documents
to a not-so-correct query Furthermore, the user will not be shown the correct keywords because there may not have been many relevant documents retrieved This research proposes that the degree of sensitivity to the used vocabulary depends on the degree of diversity of the retrieved documents, as was shown above, and also both on
Trang 32the degree of consideration of dependencies between different terms, and the dependencies between the semantic content of different documents
Uncertainty as the Main Cause of the Problems
Regardless of the above problems, IR systems can provide the necessary result under some particular conditions For example, if the user knows the title of a document precisely, then the document can generally be found easily
However, the conditions surrounding retrieval decision making in the IR system are often uncertain, as in the examples above The uncertainty is caused by these main factors: 1) the IN is fuzzy; 2) the query is an imprecise formulation of the corresponding IN, because of such problems as natural language ambiguity, as in the examples above, lack of knowledge of the correct keywords, inexperience in the domain, or ignorance of English or spelling; 3) there are changes of IN during the search process
When considering uncertainty in retrieval decision making, researchers mean only one of its components – natural language ambiguity (see sub-chapter IV.1)
Ambiguous queries may arise even for certain INs as a result of the inherent characteristics of natural language In this case, even a human, in place of a machine, would be unable to make a correct retrieval decision Graded relevance values are necessary for IR effectiveness evaluation under conditions of ambiguity
However, as shown above, the uncertainty in retrieval decision making is actually more complicated, because of its additional aspects The component of uncertainty, which differs from natural language ambiguity, is “endogenous uncertainty” Endogenous uncertainty can be reduced by improving IR algorithms;
however, it cannot be completely eliminated IR effectiveness evaluation in conditions of endogenous uncertainty can be conducted with traditional evaluation methods
An IR system that tries to eliminate the problems described in the previous section does not have to be based on fuzzy queries, minorities of users and ignorance
of the relevant keywords It must find the best average solution based on the probabilistic approach, where probabilities describe the uncertainty in IR decision
Trang 33making Thus, the less uncertainty there is, the smaller the required changes to the solution provided by the traditional IR approach
Feedback in IR and its Problems
Related to uncertainty is the concept of feedback An adaptive IR system with feedback retrieves documents in iterations, retrieving a page of documents and receiving feedback with each iteration (see Figure 1 and Figure 2) The user can guide the IR system regarding his/her IN through providing feedback regarding the relevance or non-relevance of individual documents on a page Thus feedback can reduce the IR system’s uncertainty about the user’s IN and consequently can improve retrieval effectiveness In a case when it is difficult to provide an accurate query, as
in some examples of image retrieval and in case of highly dynamic IN during the search session, the importance of feedback increases
Trang 34User
IRS
Observable
Figure 1 Dynamics of adaptive IR with feedback
Info Need Feedback
Query Querying
Information
need
Relevance estimation Retrieved
documents
Database Sorting
Info Need Correction
Representation
of Info Need
IRS User
Figure 2 IR system as a control system
However, users can appear to be reluctant to offer feedback and so this facility is often not used in practice Feedback can become more useful if the following problems are taken into account: 1) problems in design of feedback user interface; 2) the users’ mistakes in feedback, as it is provided by a person with attendant human errors; 3) poor feedback algorithms; 4) the fact that feedback in IR
is evaluative1, not instructional, so it does not allow determination of whether there are more relevant documents among the not-retrieved documents than among the retrieved documents; 5) the fact that if the retrieved documents are not diverse enough and are not relevant, feedback usually is negative and therefore has a narrowing effect
Trang 35
The last four of these five problems with feedback IR, as well as other general IR problems previously described, underline the motivation for using a dual control approach to IR, such as described in the next sub-chapter
Summary
Although IR is important as a fundamental part of a broad spectrum of information technologies, it has drawbacks that are revealed especially under conditions of an uncertain search
2 Definition
of Adaptive Dual Topic-Based IR
In order to eliminate the problems previously described it is proposed to make the IR process both adaptive dual, based on adaptive dual control theory [4], and topic-based: Adaptive Dual Topic-Based IR (ADTIR)
Adaptive Dual IR
The aim of Adaptive Dual Information Retrieval (ADIR) is to provide maximum relevance of the retrieved documents for a given query in a search session with feedback [5; 6; 7; 8; 9]
Conventional relevance feedback IR, which is aimed at maximal relevance of the current ranking of retrieved documents, is in general not optimal for the whole session It was noted in the previous sub-chapter that IR is conducted under conditions of uncertainty; the search session with feedback comprises a few iterations with separate pages of retrieved documents, and feedback is provided for every page The uncertainty of retrieval decision making has an impact on the quality
of relevance estimation and consequently on the relevance of the retrieved documents These documents have an impact on the quality of feedback that reduces the uncertainty By retrieving documents with the highest estimated relevance, traditional relevance feedback IR generally does not provide good quality feedback
Trang 36Therefore, uncertainty is not reduced as much and the relevance estimate based on such feedback is generally worse
The mechanism of providing optimal relevance in a search session by the adaptive dual approach can be explained by using two concepts of dual control theory – the dual control goals, caution and probing In the IR case, the goal of caution is achieved by retrieving more relevant documents in the current page of results; the goal of probing is achieved by retrieving more diverse documents, thus encouraging better user feedback In ADIR, the decision to retrieve a document is not made solely on its estimated relevance but is also based on its usefulness to clarify the user’s IN through feedback that will be obtained for the retrieved document Note that the two dual control goals may not be used explicitly in the controller They are used here for the purposes of explanation only
Instead of choosing documents that give the best estimated relevance at the current iteration (that is, by sorting them over relevance and choosing the most relevant) the aim of ADTIR is to choose documents that give the best combination of the two goals of dual control, such that the total relevance in the search session is maximized (see Figure 3) Note that though the total relevance is maximized, relevance at a given iteration may be smaller in ADTIR than if produced by a Probability-Ranking-Principle-based (PRP) approach [10] Such an optimal combination is found by solving an optimization model that has to somehow take into account possible user feedback for every variant of retrieved documents over all future iterations of search Thus the optimization model is modelling also the future process of interaction with the user The adjustment in dual IR is made a priori, to the future possible values of feedback, not just to the feedback already obtained, as in conventional relevance feedback IR
Trang 37Info Need Feedback
Query Querying
Information
need
Relevance estimation Retrieved
documents
Database
Info Need Correction
Representation
of Info Need
Caution Probing Optimization
IRS User
Figure 3 Adaptive Dual Information Retrieval System as a control system
The reason why the optimal value of relevance in a search session is achieved
as a combination of the dual goals, caution (relevance of documents in the current ranking) and probing (diversity of the retrieved documents), and not their separate extreme values, is explained by the following Documents that are less relevant may provide better possibilities for probing than more relevant ones Therefore probing may lead to a lower relevance of retrieved documents at the current iteration However, it also may provide better feedback, leading to better IN estimation and consequently to greater relevance at a later iteration in the search session Conversely, caution may increase relevance of the current iteration through an increase in the number of retrieved similar relevant documents but also may decrease feedback quality and accordingly decrease relevance at subsequent iterations The sorting of documents with respect to some measure of estimated relevance, as in traditional ad hoc retrieval, does not take into account these aspects of the dual goals
Topic-Based IR
To lower computational complexity, improve IR effectiveness and reduce the other previously mentioned drawbacks of traditional IR systems, documents can be organized into clusters that are assumed to represent topics Because the assumption that documents precisely belong to a topic may not be entirely valid, a decision to retrieve a cluster may be coarser and less precise than a decision to retrieve an individual document However, the introduction of clusters has distinct advantages
In addition to the advantages detailed below, clusters allow simplification of the computational complexity and therefore enable implementation of more precise but computationally feasible optimization models, as is shown in chapter VI This
Trang 38possibility rests not only on the reduced dimensionality but also on the ability to acquire relevance of a whole cluster based on a document retrieved from it
Therefore for the purposes of this thesis, the developed ADTIR models assume that the documents within a ranking are divided into topic-based clusters The solution of the optimization model thus defines not individual documents that are retrieved but quantities of documents that are retrieved from every cluster Constrained by the number of documents retrieved in each cluster, individual documents are retrieved separately (see Figure 4)
build probabilities for the given query
estimate number of stages T
i = 1
while (i <> T)
solve optimization model for stages [i,…,T]
for each cluster in stage i
retrieve # of documents according to the solution get feedback
correct probabilities for the given feedback
i = i + 1
Figure 4 Algorithm of ADTIR
Use of clusters also has other advantages that reduce the drawbacks of IR systems mentioned in sub-chapter 1 An obvious one is a simpler user interface It is easier for the user to grasp 10 topics than 1000 documents and it is easier to provide feedback
Clusters also allow the retrieved documents to be easily diversified Documents within a cluster are semantically similar; therefore by just increasing or decreasing the number of retrieved clusters, an IR system markedly changes the diversity of the retrieved documents Hence, when an ADTIR system retrieves documents not only from the most relevant cluster, but from other, less relevant clusters, the diversity of the retrieved documents is increased The retrieval of less relevant clusters is not an end in itself, but is because these clusters provide the potential for more useful feedback: that is, when the possibility of their relevance is high enough to justify their retrieval for clarification of their relevance through feedback
Trang 39Related to the above advantage is the possibility to use S-recall [11] as a measure of IR effectiveness S-recall is used instead of traditional IR evaluation measure recall because recall cannot be used for clusters (see sub-chapter IV.1); however, S-recall, which is a generalization of recall with its essence preserved, can
be used and is designed to be used with topics (or clusters that represent topics)
S-recall shows the ratio of numbers of retrieved relevant topics to all relevant topics Taking into account that the higher the number of document topics, the more diverse they are, it follows that S-recall reflects the diversity of the retrieved relevant documents Traditional recall does not consider topics; however, assuming that every single document is equally different from every other document, this measure also reflects the diversity
Thus evaluation with S-recall is very important because, as is shown in chapter 1, diversity of retrieved relevant documents is a desirable feature
sub-The Proposed Vision of IR
In ADTIR the documents are divided between semantically coherent clusters that represent topics Semantic proximity between pairs of topics (“topic- relevance”) can be measured either manually or automatically off-line The user’s
IN typically covers a single or a few topics ( “searched topics”) For a given query,
there is a random event with the corresponding probability that a topic is being searched Given that a user seeks for a specific topic, the usefulness of each of the
topics to that user is defined by their topic-relevance to the searched topic That is, the aim of the user is to retrieve documents from the topics that are topic-relevant to the searched topics ( “relevant topics”) These include also the searched topics, of
course, because topics are topic-relevant to themselves
Summary
The argument in this sub-chapter is that the proposed ADTIR may ease some drawbacks of traditional IR; therefore its further development is justified
Trang 403 A Counterexample to PRP
According to the PRP [10], which is the “bedrock of IR research” [12], if retrieved documents are ordered by their probability of relevance, then maximum precision is obtained However, as ADIR requires sometimes retrieving documents with lower probability in order to increase the effect of the dual goal of probing, it does not conform to the PRP
With regard to this, the counterexample below shows that the PRP does not provide optimal precision for IR with feedback The example also illustrates the idea behind DIR
Consider an interaction between the IR system and the user, for a query
“apple” and the topic-relevance estimated by the IR system, given in Table 1 The IR system has a predisposition to promote documents about “apple computers” higher in the ranking than other apple topics such as apple fruit.Users submitting this query may search for the 3 topics given in the table The estimated relevance should represent the proportions of these users Interpreting the data in Table 1 in this way, means most users would search for “apple computer”, fewer users for “apple fruit” and the least number of users for “apple tree”
Table 1 Estimated relevance of one document
Topic Estimated relevance before
Relevance for a Minority User
Suppose a user is searching for the topic “apple tree” He/she is then a minority user The corresponding actual topic-relevance values are given in Table 2