Probabilistic Search and Query

CHAPTER 3 MODELING AN INFORMATION RETRIEVAL SYSTEM

3.3.3 Probabilistic Search and Query

Ultimately, the query load incurred by a mediator, and by relation any sources beneath it, will be dependent on the number of queries that mediator is asked to service. This value depends on a number of factors, including the mediator’s perceived value, the average number of queries arriving in the system, the number and value of competing mediators, and how many mediators are used to answer the query. To

estimate this, we must ﬁrst determine the relative rank ordering Mr of the mediator in question M, and the number of mediators Rr that share that ranking.

Mr = 1 +

k∈O.mediators

0max(M.prs−k.prs,0)−0abs(M.prs−k.prs)

(3.4)

Rr =

k∈O.mediators

0abs(Mr−kr) (3.5)

Where prs is the perceived response size of the respective mediator. The summation term will equate to 1 when the competing size is higher, and 0 when lower. Thus, the highest ranked mediator will be 1, followed by 2, and so forth. Mediators with the same value will have the same ranking. Using this information, it is possible to compute the probability P(M|Q) that mediator M will be selected to service query Q.

P(M|Q) = s

|M| |M1|−1

s−1

q−1

i=0

min(s,Rr)−1 j=0

|M| −Mr−Rr+ 1 s−i−j−1

× Mr−1

Rr−1 j

min

1, q−i j+ 1

(3.6)

Where|M|is the total number of mediators,sis the number of mediators that will be searched and compared, and q is the number of mediators that will be given the query. Equation (3.6) models the search process and subsequent mediator selection that will take place when a query is received by the system. In this particular domain, some subset of the available mediators will be searched, and ranked based on their collection signatures. Using these ranks, a subset of those searched will actually be selected to service the query.

The general problem of ﬁnding appropriate partners or peers within a larger population is common in agent systems, so it is worth discussing Equation 3.6 in greater

detail. It is helpful to ignore the domain characteristics of the search and selection process and focus ﬁrst on the underlying counting and probability problem. I will use the familiar ball-and-bag metaphor. You are given a bag containing a set of black, green and red balls of known size, named B, G and R respectively. One green ball g∗ ∈ G is distinguished from the others. Select s balls from the bag to form set S.

Choose q balls from S in order of preference red, green, black, i.e., a green ball will only be selected if no reds remain. Call this new setQ. I wish to ﬁnd the probability thatg∗ ∈Q. In this abstraction,g∗ is the mediator in questionM. R represents those mediators higher ranked than M, G is the mediators of equal rank, and B those of lower rank. S is the set of mediators that are searched, and Q those that ultimately receive the query.

Having isolated a underlying probability model for this problem, we can embellish that model with the characteristics speciﬁc to this domain. First, assume that all mediators may be initially searched with equal probability, and that selection from a set of equally-ranked mediators is done uniformly. The probability that mediator M is searched, which depends on the total number searched and the total number of mediators, is simply |Ms|. Given that M will be searched, the nested summations count the total number of sets of remaining mediators that both could be searched and would result in M receiving the query. A ratio of this total to the number of unrestricted mediator combinations that are possible from the search |M|−1

s−1

will provide the ﬁnal desired probability. The summations work by iterating over the various ways in which the mediator search set might be composed. On each loop, a value is selected for the number i of higher ranked mediators and j of equally ranked mediators that will exist in the set, the remainder being made up of lower ranked mediators. Since i < q, then there will be at least one spot for a mediator rankedr. There areRr−1

equal valued mediators competing for the available query slots, and the ﬁnal ratio is calculates the fraction of those that might contain M.

content_organization utility = 4.60192 other_mediators = 3 response_recall = 0.52245 response_time = 62.25714

environment topic_query_rate = 0.002

topic_size = 700 query_set_size = 2 search_set_size = 5

mediator topic_size = 240

rank = 1 query_probability = 0.71429 response_time = 62.25714

mediator topic_size = 160

rank = 2 query_probability = 0.57143 response_time = 61.51143

mediator topic_size = 160

rank = 2 query_probability = 0.57143 response_time = 61.51143

mediator topic_size = 80 rank = 4 query_probability = 0.14286 response_time = 60.59429

agent com_load = 1.00571 wrk_load = 0.69429 database

data_size = 100 topic_size = 80

database data_size = 100 topic_size = 80

agent com_load = 0.04571 wrk_load = 0.51143

agent com_load = 0.57857 wrk_load = 0.44714

database data_size = 100 topic_size = 80

agent com_load = 0.03857 wrk_load = 0.44714

agent com_load = 0.57857 wrk_load = 0.44714

database data_size = 100 topic_size = 80

agent com_load = 0.03857 wrk_load = 0.44714

agent com_load = 0.13714 wrk_load = 0.18571

database data_size = 100 topic_size = 80

agent com_load = 0.01714 wrk_load = 0.25429

Figure 3.3. An information retrieval instance with variously ranked mediators.

The model in Appendix D uses these equations to determine the final topic query rate for a particular mediator, specifically in the mediator node’s rank, rank ties, query probability and query rate fields.

An example organization showing the effects of this formulation is shown in Figure 3.3. In this instance, there are four mediators, one with three sources, two with two sources each, and one with a single source. All databases in this model have an equal amount of topic data, so a ranking of {1,2,2,4} can be determined among the mediators respectively, as shown in the model. In addition, there are three other mediators in the organization that contain an insignificant amount of topic data and are not graphically shown. These “other” mediators are significant because they can potentially distract the search process, resulting in a decrease in expected utility.

The environment node shows that the search set size in this instance is set to 5, indicating that the collection signatures of ﬁve other mediators will be searched.

The query set size, the number of mediators from the search set that will actually be queried, is set to 3. Therefore, as the number of “other” mediators grows, the chance that one of the relevant mediators will be found and subsequently queried decreases. The value of |M| in Equation (3.6) is the sum of the relevant mediators and these other mediators. The culmination of these data occurs in the calculation of Pr(M|Q), shown in the query probability for each mediator. These are used to compute the organization’s response recall, and ultimately aﬀect the utility of the organizational structure.

Figure 3.4. A comparison of the predicted and empirical response recall values across a range of search and query size parameters.

To test this formulation, a set of simulation trials were performed, and the observed response recall compared to the predicted value for each scenario. The environment consisted of six mediators and nine databases, and each trial consisted of 100 queries from a simulated user to a random mediator in the organization. The first mediator had four of the databases below it, the second had three and the third had two. The remaining three mediators had no databases, and therefore could not provide value to queries, although their presence made it more difficult to find the actual sources of data because they increased the size of the population to be searched.

The perceived response size and actual response size for each mediator was propor- tional to the number of databases it had access to. In the trials, both the number of mediators that were searched for, search set size, and the number of mediators that were queried, query set size ranged from 1 to 6, producing 36 possible experiments.

In practice, only 21 of these were valid, becausequery set sizemust be greater than or

Query Size

1 2 3 4 5 6

SearchSize

1 -0.059

2 -0.011 0.013

3 0.008 -0.024 -0.003

4 -0.005 0.005 0.005 -0.002

5 -0.004 -0.006 -0.011 0.005 -0.012

6 -0.004 -0.003 -0.002 -0.001 -0.003 -0.003

Table 3.1. The relative error (i.e., (observed−predicted)/observed) between the predicted and empirical response recall values from Figure 3.4.

equal tosearch set size. A graph comparing the values predicted by the ODML model and the empirical results are shown in Figure 3.4. As expected, when the search size is small, the recall suﬀers, because it is less likely a good information source will be found. Thequery set size has a similar but lesser eﬀect. For clarity, the relative error between the predicted and observed values are given in Figure 3.1. This shows that the predictions were quite accurate in most cases, with a maximum relative error of 5.9% in one case and an average of 0.9% error over all cases.

The relationships described here are a good example of how changes to the organization can indirectly affect the characteristics of many, potentially distant parts of the structure. In this case, the perceived, relative quality of a mediator, which is based on the sources under its control, affects the ranking of all other mediators in the organization. These rankings affects query load, which affects the load imposed on the agents, which can affect both the constraints on those agents and the response time of a mediator’s hierarchy as a whole. Thus it is possible for a single source added to some segment of the organization to dramatically affect nodes with which it does not obviously interact. These effects can be subtle yet important, motivating the need for a model such as ODML capable of representing them. It is also shown in Section 4.2.1 how this type of indirect interrelationship can make it particularly difficult to determine either the validity of an organizational instance prior to its com-

plete construction. I will return to this problem in Chapter 4, which discusses how the organizational design problem can be framed as a search for the most appropriate valid instance.

The Distributed Sensor Network Domain

Conﬂicts, Constraints and Resolution