1. Trang chủ
  2. » Ngoại Ngữ

Adaptive dual control of topic based information retrieval

165 164 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 165
Dung lượng 1,36 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Hence, despite a large number of retrieved documents, the user, who is given only a page of documents at a time, does not know if he/she can find documents about the fruit topic of “appl

Trang 1

Adaptive Dual Control of

Topic-Based Information Retrieval

A dissertation submitted in partial fulfilment of the requirement

for the degree of Doctor of Philosophy

by

Vitaliy Vitsentiy

M.Sc Ternopil Academy of National Economy, Ukraine

School of Information Technology Faculty of Science and Technology Queensland University of Technology

Brisbane, Australia

2009

Trang 3

Keywords

topic-based information retrieval, dual control, stochastic programming

Trang 5

This thesis presents an adaptive IR system based on the theory of adaptive dual control The aim of the approach is the optimization of retrieval precision after all feedback has been issued This is done by increasing the diversity of retrieved documents This study shows that the value of recall reflects this diversity

The Probability Ranking Principle is viewed in the literature as the “bedrock”

of current probabilistic Information Retrieval theory Neither the proposed approach nor other methods of diversification of retrieved documents from the literature conform to this principle This study shows by counterexample that the Probability Ranking Principle does not in general lead to optimal precision in a search session with feedback (for which it may not have been designed but is actively used)

Retrieval precision of the search session should be optimized with a multistage stochastic programming model to accomplish the aim However, such models are computationally intractable Therefore, approximate linear multistage stochastic programming models are derived in this study, where the multistage improvement of the probability distribution is modelled using the proposed feedback correctness method The proposed optimization models are based on several assumptions, starting with the assumption that Information Retrieval is conducted in units of topics

Trang 6

The use of clusters is the primary reasons why a new method of probability estimation is proposed

The adaptive dual control of topic-based IR system was evaluated in a series

of experiments conducted on the Reuters, Wikipedia and TREC collections of documents The Wikipedia experiment revealed that the dual control feedback mechanism improves precision and S-recall when all the underlying assumptions are satisfied In the TREC experiment, this feedback mechanism was compared to a state-of-the-art adaptive IR system based on BM-25 term weighting and the Rocchio relevance feedback algorithm The baseline system exhibited better effectiveness than the cluster-based optimization model of ADTIR The main reason for this was insufficient quality of the generated clusters in the TREC collection that violated the underlying assumption

Trang 7

Table of Contents

Keywords iii

Abstract v

Table of Contents vii

List of Tables xiii

List of Figures xv

List of Acronyms and Abbreviations xvii

Basic Mathematical Notation xix

Statement of Original Authorship xxiii

Acknowledgements xxv

I Introduction 1

1 Motivation 1

Importance of IR 1

Empirical Evidence of Problems in IR 1

Uncertainty as the Main Cause of the Problems 4

Feedback in IR and its Problems 5

Summary 7

2 Definition of Adaptive Dual Topic-Based IR 7

Adaptive Dual IR 7

Topic-Based IR 9

Trang 8

The Proposed Vision of IR 11

Summary 11

3 A Counterexample to PRP 12

Relevance for a Minority User 12

Expected Relevance across all Users 14

Summary 16

II Design of the Research 17

1 Methodology of the Research 17

Guidelines 17

Hypotheses of the Research 18

Contributions 19

Summary 21

2 Taxonomy of the Research Problem 21

Information Retrieval 21

Adaptive Dual Control and Stochastic Programming 21

Theory of Algorithms 22

Artificial Intelligence 23

Machine Learning 23

Summary 24

3 Outline of the Further Narrative 24

III Review of Probabilistic and Topic-Based IR 26

1 Probabilistic Approaches to IR 26

Probability Ranking Principle 26

Probabilistic Models 27

Language models 28

Summary 30

Trang 9

2 Topic-Based IR 31

Latent Semantic Analysis 31

Cluster model 32

Probabilistic Latent Semantic Analysis 34

Latent Dirichlet Allocation 35

Summary 39

3 Feedback in Adaptive IR 40

Feedback for Vector Space Models 40

Feedback for Probabilistic Models 41

Feedback for Language Models 41

Feedback for LSA model 42

Summary 42

IV Review of Uncertainty-Related Methods 43

1 Problems of Uncertainty and Diversity in IR 43

Uncertainty in IR 43

Diversity in IR 45

Evaluation of Diversity in IR 46

Summary 47

2 Approaches to Tackle Uncertainty and Diversity in IR 48

Diversity Stimulation 48

Multicriterion Matching Scores 49

Active Learning 50

Reinforcement Learning 52

Summary 52

3 Adaptive Dual Control and Stochastic Programming 52

Adaptive Dual Control 52

Trang 10

Direct Methods 54

Indirect Methods 55

Stochastic Programming 56

Summary 58

V Relevance Estimation 60

1 Probability Estimation 60

Modelling Probabilities Based on Searched Features 60

The Language Modelling Approach 62

The Document Sampling Approach 62

Probabilistic User-based Model 65

Summary 67

2 Expected Relevance 67

The General Approach 67

Smoothing 67

Learning Topic-Relevance and Bias Coefficients 68

Feedback 70

Summary 71

VI Decision Optimization 73

1 Two-stage Stochastic Program 73

Optimization in Space of Documents 73

Optimization in the Space of Clusters 75

Approximate Formulation 77

Relaxed Approximate Formulation 78

Linear Approximate Formulation 80

Summary 80

2 Multistage Stochastic Program 81

Trang 11

Feedback Correctness Approach 82

Linear Equivalent 84

Summary 91

VII Experiments 92

1 Experimental Design 92

The Goals 92

Document Collections 93

Queries and Relevance Judgments 94

Baseline Systems 95

ADTIR Systems 96

Evaluation Measures 97

Reuters experiment 97

Software 98

Summary 98

2 Results and Discussion 99

Reuters Experiment 99

Wikipedia Experiment 99

TREC Experiment 103

Summary 106

VIII Conclusions 108

Addressed Problems 108

Findings 109

Further Research 110

A Derivation of Expected Relevance in the Example 111

B Retrieved Documents for Query “Apple” 114

C Query Set for Reuters Experiment 121

Trang 12

D Candidate User-based Model Functions 123

E An Example of Program Output 124Bibliography 129

Trang 13

List of Tables

Table 1 Estimated relevance of one document 12

Table 2 Actual relevance of one document 13

Table 3 PRP-based approach Case of the user’s given relevant topic 13

Table 4 Not-PRP-based approach Case of the user’s given relevant topic 13

Table 5 Comparison of the results for the user’s given relevant topic 14

Table 6 Second page PRP-based approach 15

Table 7 Second page Non-PRP-based approach 15

Table 8 Document Collections 93

Table 9 Parameters of the TREC’s experiment baseline system 96

Table 10 Parameters of the ADTIR systems 96

Table 11 Topic-relevance and bias coefficients 99

Table 12 Bias coefficients only 99

Table 13 Retrieval effectiveness in Wikipedia experiment 100

Table 14 Retrieval effectiveness in TREC experiment 104

Table 15 PRP-based approach 111

Table 16 Not-PRP-based approach 113

Trang 15

List of Figures

Figure 1 Dynamics of adaptive IR with feedback 6

Figure 2 IR system as a control system 6

Figure 3 Adaptive Dual Information Retrieval System as a control system 9

Figure 4 Algorithm of ADTIR 10

Figure 5 Comparison of the results for an average user 15

Figure 6 Graphical model representation of PLSA 34

Figure 7 Graphical model representation of LDA 37

Figure 8 Precision in Wikipedia experiment 101

Figure 9 S-Recall in Wikipedia experiment 101

Figure 10 Precision on iterations in Wikipedia experiment 102

Figure 11 S-Recall on iterations in Wikipedia experiment 103

Figure 12 Precision in TREC experiment 105

Figure 13 Recall in TREC experiment 105

Figure 14 Precision on iterations in TREC experiment 106

Figure 15 Recall on iterations in TREC experiment 106

Trang 17

List of Acronyms and Abbreviations

ADIR Adaptive Dual Information Retrieval

ADTIR Adaptive Dual Topic-Based Information Retrieval

LSA Latent Semantic Analysis

LSI Latent Semantic Indexing

PDF Probability Density Function

PMF Probability Mass Function

PRP Probability Ranking Principle

SVD Singular Value Decomposition

Trang 19

Basic Mathematical Notation1

Trang 20

f a component of function of x i g with coefficient b that models v

selection of query terms by the user;

a ′ topic-relevance between topics z and z′ multiplied by the sample

bias coefficient and used in the optimization algorithm, usually is found from the learning query set;

Trang 21

N set of natural numbers

g ′, y zt z′ number of documents to retrieve from cluster z if according to

feedback topic of cluster z′ is relevant;

u

ρ probability of correct feedback on a topic if number of retrieved pages

with a document on a topic is u;

u

c a combination of numbers of retrieved documents on the iterations up

to u

Trang 23

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made

Signed: Date

Trang 25

Acknowledgements

We sincerely thank the supervisor of this research, Prof Peter Bruza, for his continuous support throughout the dissertation work, and especially at those times when the preliminary research results have been difficult to convey to other people

We also thank the associate supervisor, Prof Amanda Spink, for her encouragement

to work on this research subject at QUT We are grateful to Prof Anatoly Sachenko and Prof George Markowsky, with whom we worked previously on this research subject Furthermore, we are in debt to our proofreaders

This research has been supported by NICTA and QUT with the PhD scholarship, equipment and an excellent working environment

Trang 27

For Olya

Trang 29

it is also important but often even more difficult to search for necessary articles For example patents are difficult to search because their owners often wish to obfuscate claims in the pursuit of their commercial interests In any case, it is sometimes difficult for the user to optimally represent his/her Information Need (IN) as a query formulated for the IR system In such conditions a manual search is inefficient, and

so the role of support from the IR system is very important

Empirical Evidence of Problems in IR

The following empirical evidence of problems that exist in the current level

of support from web IR systems provides motivation for this research and illustrates its research problems Although these familiar issues are evidenced in the literature review as well, these research problems have not yet been solved and are still active research topics within the IR research community

Trang 30

The first IR problem considered is that of “insufficient diversity” Take

query “apple” as an example With such a query on the web, there are usually only documents about Apple computers in the first pages of the retrieved documents and none about the fruit, tree, and other “apple” concepts (see appendix B) The query

“apple” is equally suitable for all of these topics, but only documents about Apple computers are retrieved in the top part of the ranking Either more precise re-formulation of the query or browsing through further pages down the ranking is necessary to encounter other topics related to “apple” other than the computer topic The resulting web search is more suitable for users interested in Apple computers than for users interested in other topics related to “apple” There are many web pages about apple fruits, apple trees, apple dishes, even a political party called “apple” Their existence indicates that they also have users who may want to search for them These users are left unsatisfied with the initial query result and must continue the search So subjective assessment suggests that, when ranking documents for a given query, IR systems retrieve documents suitable for some users and do not take into account all users

Recent IR research papers [1], as well as sub-chapters IV.1, and IV.2, cover the problem of insufficient diversity and recognize the need for more diverse documents

Other evidence of the above problem comes from the fact that almost all current document collections for evaluation of IR systems have one number as a relevance judgment for each query and document [1] That is, each judgment cannot represent all the topics of the query that cover the desired outcomes for different users The aim of traditional IR systems is to get the best performance for one topic only per each query In order to improve this, document collections for IR evaluation must include graded relevance judgments that would cover a wider selection of relevant topics

In the case of multitasking IR [2], there are a few relevant topics In this situation the topics are not alternatives for a given user, as considered in the example above, but are often mutually complementary Therefore more diverse retrieved documents better satisfy such information searches, even for individual users For

Trang 31

example, in submitting the query “apple”, an individual user may be interested in both fruits and apple recipes

Often many documents are selected by the IR system as relevant to the query and are retrieved for the user For example, there are 586000000 retrieved documents for the query “apple” (see appendix B) However, there are no documents about the fruit in the first and second pages of retrieved documents and only a few documents

on the third page Hence, despite a large number of retrieved documents, the user, who is given only a page of documents at a time, does not know if he/she can find documents about the fruit topic of “apple” in subsequent pages Often even after the query is modified there are too many retrieved documents and the relevant ones are scattered too sparsely to be conveniently reviewed, so that the user must continue modifying the query or browsing through many further pages of results Furthermore,

if the user provides feedback after viewing a page then the IR system will retrieve another set of documents, and so he/she may no longer be able to retrieve documents that were in subsequent pages down the ranking before the feedback was provided

So, besides the possible poor relevance estimation by the IR system, the pages retrieved to the user do not provide him/her a full picture of all retrieved documents

A better way would be, for example, to provide topics that can give a better overall picture of the documents present in the ranking

The above examples also reveal another problem, known as “vocabulary mismatch” [3] Some words in the query that seem appropriate from the user’s point

of view may not retrieve relevant documents; however if the specific terminology that occurs in the relevant documents is used in the query, then the relevant documents can be retrieved Therefore the retrieval result is very sensitive to the vocabulary used; in order to get the desired outcome, one has to use the correct keywords The problem is that the user may not know those keywords so that he/she submits a query that only approximately describes the IN IR systems designed to achieve maximal relevance for a correct query, will retrieve non-relevant documents

to a not-so-correct query Furthermore, the user will not be shown the correct keywords because there may not have been many relevant documents retrieved This research proposes that the degree of sensitivity to the used vocabulary depends on the degree of diversity of the retrieved documents, as was shown above, and also both on

Trang 32

the degree of consideration of dependencies between different terms, and the dependencies between the semantic content of different documents

Uncertainty as the Main Cause of the Problems

Regardless of the above problems, IR systems can provide the necessary result under some particular conditions For example, if the user knows the title of a document precisely, then the document can generally be found easily

However, the conditions surrounding retrieval decision making in the IR system are often uncertain, as in the examples above The uncertainty is caused by these main factors: 1) the IN is fuzzy; 2) the query is an imprecise formulation of the corresponding IN, because of such problems as natural language ambiguity, as in the examples above, lack of knowledge of the correct keywords, inexperience in the domain, or ignorance of English or spelling; 3) there are changes of IN during the search process

When considering uncertainty in retrieval decision making, researchers mean only one of its components – natural language ambiguity (see sub-chapter IV.1)

Ambiguous queries may arise even for certain INs as a result of the inherent characteristics of natural language In this case, even a human, in place of a machine, would be unable to make a correct retrieval decision Graded relevance values are necessary for IR effectiveness evaluation under conditions of ambiguity

However, as shown above, the uncertainty in retrieval decision making is actually more complicated, because of its additional aspects The component of uncertainty, which differs from natural language ambiguity, is “endogenous uncertainty” Endogenous uncertainty can be reduced by improving IR algorithms;

however, it cannot be completely eliminated IR effectiveness evaluation in conditions of endogenous uncertainty can be conducted with traditional evaluation methods

An IR system that tries to eliminate the problems described in the previous section does not have to be based on fuzzy queries, minorities of users and ignorance

of the relevant keywords It must find the best average solution based on the probabilistic approach, where probabilities describe the uncertainty in IR decision

Trang 33

making Thus, the less uncertainty there is, the smaller the required changes to the solution provided by the traditional IR approach

Feedback in IR and its Problems

Related to uncertainty is the concept of feedback An adaptive IR system with feedback retrieves documents in iterations, retrieving a page of documents and receiving feedback with each iteration (see Figure 1 and Figure 2) The user can guide the IR system regarding his/her IN through providing feedback regarding the relevance or non-relevance of individual documents on a page Thus feedback can reduce the IR system’s uncertainty about the user’s IN and consequently can improve retrieval effectiveness In a case when it is difficult to provide an accurate query, as

in some examples of image retrieval and in case of highly dynamic IN during the search session, the importance of feedback increases

Trang 34

User

IRS

Observable

Figure 1 Dynamics of adaptive IR with feedback

Info Need Feedback

Query Querying

Information

need

Relevance estimation Retrieved

documents

Database Sorting

Info Need Correction

Representation

of Info Need

IRS User

Figure 2 IR system as a control system

However, users can appear to be reluctant to offer feedback and so this facility is often not used in practice Feedback can become more useful if the following problems are taken into account: 1) problems in design of feedback user interface; 2) the users’ mistakes in feedback, as it is provided by a person with attendant human errors; 3) poor feedback algorithms; 4) the fact that feedback in IR

is evaluative1, not instructional, so it does not allow determination of whether there are more relevant documents among the not-retrieved documents than among the retrieved documents; 5) the fact that if the retrieved documents are not diverse enough and are not relevant, feedback usually is negative and therefore has a narrowing effect

Trang 35

The last four of these five problems with feedback IR, as well as other general IR problems previously described, underline the motivation for using a dual control approach to IR, such as described in the next sub-chapter

Summary

Although IR is important as a fundamental part of a broad spectrum of information technologies, it has drawbacks that are revealed especially under conditions of an uncertain search

2 Definition

of Adaptive Dual Topic-Based IR

In order to eliminate the problems previously described it is proposed to make the IR process both adaptive dual, based on adaptive dual control theory [4], and topic-based: Adaptive Dual Topic-Based IR (ADTIR)

Adaptive Dual IR

The aim of Adaptive Dual Information Retrieval (ADIR) is to provide maximum relevance of the retrieved documents for a given query in a search session with feedback [5; 6; 7; 8; 9]

Conventional relevance feedback IR, which is aimed at maximal relevance of the current ranking of retrieved documents, is in general not optimal for the whole session It was noted in the previous sub-chapter that IR is conducted under conditions of uncertainty; the search session with feedback comprises a few iterations with separate pages of retrieved documents, and feedback is provided for every page The uncertainty of retrieval decision making has an impact on the quality

of relevance estimation and consequently on the relevance of the retrieved documents These documents have an impact on the quality of feedback that reduces the uncertainty By retrieving documents with the highest estimated relevance, traditional relevance feedback IR generally does not provide good quality feedback

Trang 36

Therefore, uncertainty is not reduced as much and the relevance estimate based on such feedback is generally worse

The mechanism of providing optimal relevance in a search session by the adaptive dual approach can be explained by using two concepts of dual control theory – the dual control goals, caution and probing In the IR case, the goal of caution is achieved by retrieving more relevant documents in the current page of results; the goal of probing is achieved by retrieving more diverse documents, thus encouraging better user feedback In ADIR, the decision to retrieve a document is not made solely on its estimated relevance but is also based on its usefulness to clarify the user’s IN through feedback that will be obtained for the retrieved document Note that the two dual control goals may not be used explicitly in the controller They are used here for the purposes of explanation only

Instead of choosing documents that give the best estimated relevance at the current iteration (that is, by sorting them over relevance and choosing the most relevant) the aim of ADTIR is to choose documents that give the best combination of the two goals of dual control, such that the total relevance in the search session is maximized (see Figure 3) Note that though the total relevance is maximized, relevance at a given iteration may be smaller in ADTIR than if produced by a Probability-Ranking-Principle-based (PRP) approach [10] Such an optimal combination is found by solving an optimization model that has to somehow take into account possible user feedback for every variant of retrieved documents over all future iterations of search Thus the optimization model is modelling also the future process of interaction with the user The adjustment in dual IR is made a priori, to the future possible values of feedback, not just to the feedback already obtained, as in conventional relevance feedback IR

Trang 37

Info Need Feedback

Query Querying

Information

need

Relevance estimation Retrieved

documents

Database

Info Need Correction

Representation

of Info Need

Caution Probing Optimization

IRS User

Figure 3 Adaptive Dual Information Retrieval System as a control system

The reason why the optimal value of relevance in a search session is achieved

as a combination of the dual goals, caution (relevance of documents in the current ranking) and probing (diversity of the retrieved documents), and not their separate extreme values, is explained by the following Documents that are less relevant may provide better possibilities for probing than more relevant ones Therefore probing may lead to a lower relevance of retrieved documents at the current iteration However, it also may provide better feedback, leading to better IN estimation and consequently to greater relevance at a later iteration in the search session Conversely, caution may increase relevance of the current iteration through an increase in the number of retrieved similar relevant documents but also may decrease feedback quality and accordingly decrease relevance at subsequent iterations The sorting of documents with respect to some measure of estimated relevance, as in traditional ad hoc retrieval, does not take into account these aspects of the dual goals

Topic-Based IR

To lower computational complexity, improve IR effectiveness and reduce the other previously mentioned drawbacks of traditional IR systems, documents can be organized into clusters that are assumed to represent topics Because the assumption that documents precisely belong to a topic may not be entirely valid, a decision to retrieve a cluster may be coarser and less precise than a decision to retrieve an individual document However, the introduction of clusters has distinct advantages

In addition to the advantages detailed below, clusters allow simplification of the computational complexity and therefore enable implementation of more precise but computationally feasible optimization models, as is shown in chapter VI This

Trang 38

possibility rests not only on the reduced dimensionality but also on the ability to acquire relevance of a whole cluster based on a document retrieved from it

Therefore for the purposes of this thesis, the developed ADTIR models assume that the documents within a ranking are divided into topic-based clusters The solution of the optimization model thus defines not individual documents that are retrieved but quantities of documents that are retrieved from every cluster Constrained by the number of documents retrieved in each cluster, individual documents are retrieved separately (see Figure 4)

build probabilities for the given query

estimate number of stages T

i = 1

while (i <> T)

solve optimization model for stages [i,…,T]

for each cluster in stage i

retrieve # of documents according to the solution get feedback

correct probabilities for the given feedback

i = i + 1

Figure 4 Algorithm of ADTIR

Use of clusters also has other advantages that reduce the drawbacks of IR systems mentioned in sub-chapter 1 An obvious one is a simpler user interface It is easier for the user to grasp 10 topics than 1000 documents and it is easier to provide feedback

Clusters also allow the retrieved documents to be easily diversified Documents within a cluster are semantically similar; therefore by just increasing or decreasing the number of retrieved clusters, an IR system markedly changes the diversity of the retrieved documents Hence, when an ADTIR system retrieves documents not only from the most relevant cluster, but from other, less relevant clusters, the diversity of the retrieved documents is increased The retrieval of less relevant clusters is not an end in itself, but is because these clusters provide the potential for more useful feedback: that is, when the possibility of their relevance is high enough to justify their retrieval for clarification of their relevance through feedback

Trang 39

Related to the above advantage is the possibility to use S-recall [11] as a measure of IR effectiveness S-recall is used instead of traditional IR evaluation measure recall because recall cannot be used for clusters (see sub-chapter IV.1); however, S-recall, which is a generalization of recall with its essence preserved, can

be used and is designed to be used with topics (or clusters that represent topics)

S-recall shows the ratio of numbers of retrieved relevant topics to all relevant topics Taking into account that the higher the number of document topics, the more diverse they are, it follows that S-recall reflects the diversity of the retrieved relevant documents Traditional recall does not consider topics; however, assuming that every single document is equally different from every other document, this measure also reflects the diversity

Thus evaluation with S-recall is very important because, as is shown in chapter 1, diversity of retrieved relevant documents is a desirable feature

sub-The Proposed Vision of IR

In ADTIR the documents are divided between semantically coherent clusters that represent topics Semantic proximity between pairs of topics (“topic- relevance”) can be measured either manually or automatically off-line The user’s

IN typically covers a single or a few topics ( “searched topics”) For a given query,

there is a random event with the corresponding probability that a topic is being searched Given that a user seeks for a specific topic, the usefulness of each of the

topics to that user is defined by their topic-relevance to the searched topic That is, the aim of the user is to retrieve documents from the topics that are topic-relevant to the searched topics ( “relevant topics”) These include also the searched topics, of

course, because topics are topic-relevant to themselves

Summary

The argument in this sub-chapter is that the proposed ADTIR may ease some drawbacks of traditional IR; therefore its further development is justified

Trang 40

3 A Counterexample to PRP

According to the PRP [10], which is the “bedrock of IR research” [12], if retrieved documents are ordered by their probability of relevance, then maximum precision is obtained However, as ADIR requires sometimes retrieving documents with lower probability in order to increase the effect of the dual goal of probing, it does not conform to the PRP

With regard to this, the counterexample below shows that the PRP does not provide optimal precision for IR with feedback The example also illustrates the idea behind DIR

Consider an interaction between the IR system and the user, for a query

“apple” and the topic-relevance estimated by the IR system, given in Table 1 The IR system has a predisposition to promote documents about “apple computers” higher in the ranking than other apple topics such as apple fruit.Users submitting this query may search for the 3 topics given in the table The estimated relevance should represent the proportions of these users Interpreting the data in Table 1 in this way, means most users would search for “apple computer”, fewer users for “apple fruit” and the least number of users for “apple tree”

Table 1 Estimated relevance of one document

Topic Estimated relevance before

Relevance for a Minority User

Suppose a user is searching for the topic “apple tree” He/she is then a minority user The corresponding actual topic-relevance values are given in Table 2

Ngày đăng: 07/08/2017, 11:41