The first study in this chapter indicates that the more of the query we can directly expose to the pruning logic of the optimizations, the more chances the algorithms have to elide work. Based on this lead, we now look at the correlation of what percentage of the query we are now exposing, and the resulting effectiveness of the optimization algorithms.
3.3.1 Experimental Setup
The software and hardware configurations are similar as before, but we now focus solely on establishing an explanatory variable and response variable correlation.
Note that although we would like to increase the size of the query under the root, the depth of the tree is not as important a factor as simply the number of compu- tations hidden under the root. Consider the two query constructions in Figure 3.7.
Although the upper query is deeper than the lower one, both queries provide the same opportunities for pruning in their original forms. The scoring nodes used in pruning are indicated by the triangular nodes. If we assume that all of the non-leaf nodes in the query meet our criteria for flattening, then the flattening process will
reduce both queries to only make use of the leaf nodes, indicated by the square nodes in the figure.
(a)
(b)
Figure 3.7. Comparison of a deep query tree, vs a wide tree with the same number of scoring leaves. To the dynamic optimization algorithms of today, the two offer the same chances for eliding work.
In order to generate an appropriate query set for this experiment, we take the following steps:
1. We intersect the vocabulary of the collection in question against the common American dictionary words on the Ubuntu 12.04 Linux distribution4.
2. Shuffle the remaining set of words using the Random class provided by Java 1.6 using a seed value of 100.
3. We then iteratively generate random numbers between 2 and 50, using the Random.nextInt function, until the sum of these numbers is greater than the
4The file used is/usr/share/dict/american-english.
size of the vocabulary. We then partition the sequence of words according to these numbers, using them as lengths for the resulting subsequences of words (i.e. if the first three numbers are 3, 17, and 20, we would take the first 3 words to make the first query, then the next 17 words for the second query, 20 words for the third query, and so on). The last query is either the number of remaining words, or the size of the last number generated.
Using the steps above, 2280 queries were generated. For each query we then generate up to 50 feedback terms using Relevance Model-driven PRF. Let Q = q1. . . qn be a query, and E =e1. . . em be a set of feedback terms. Keeping in mind that we want to explore the effect of exposing more nodes for pruning via flattening, we generate a particular query to as follows. Let b be the branching factor, and let l=b(|Q|+|E|)/2c. We varybbetween 2 andl, and for a given setting ofb, that is the number of nodes under an interior node in the query tree. Therefore, if|Q|+|E|= 20 andb = 3, then the tree will have a root node with 6 children nodes, each of which will have 3 leaves under it (the final two terms are dropped to make the branching even).
As another example, image (b) in Figure 3.7 has 24 leaves, and b = 8. Therefore there are 3 internal nodes, each of which has 8 children. We hypothesize that the branching factor b has a negative correlation with execution time: as b increases, execution time should decrease. We use theratioof the branch factorb over the sum of terms (feedback terms + original query terms) as the random variable used for testing. In order to control for as many variables as possible, a sample is generated by grouping the data points by method (eitherMaxscoreorWand), query id, and the number of feedback terms. We then calculate the correlation between the ratios
and times in the sample. Each correlation coefficient is then used as a data point in the analysis.
3.3.2 Results
Figure 3.8 shows a plot of the sample correlation coefficients. The majority of the points lie below -0.36 (the third quartile), and the median and mean values are -0.58 and -0.4853, respectively. From this plot we can infer that most queries benefit from increased exposure of the scoring nodes, however a small percentage of queries actually suffer from more opportunities to prune. The most likely explanation for the positive correlations is that these queries are “hard” queries, in that they rarely or never trigger pruning. In this situation, the overhead incurred when trying to prune simply slows execution down, and adding more checks that never trigger pruning only serve to exacerbate the problem.