Implementing Query Flattening in Julien

6. A BEHAVIORAL VIEW OF QUERY EXECUTION

6.5 Implementing Multiple Optimizations Concurrently

6.5.1 Implementing Query Flattening in Julien

We now discuss the approach taken to implement a generic form of the flattening optimization, introduced in Chapter 3, in Julien. We implement this optimization by defining theDistributivebehavior. The purpose of this behavior is to indicate that

the implementing operator is an interpolated subquery, and if the operator occurs in the correct context, that this subquery’s operator can be removed. An operator that exhibits this behavior has two new methods exposed:

1. distribute() → List[Feature]

2. setChildren(List[Feature])

Given a Feature nodeN, and its childrenc1, . . . , cj, thedistributemethod reweights allciby multiplying the weight ofN into each child, then returns the list of reweighted child nodes. The setChildren function replaces the current set of children with a new provided set of child nodes. These two methods alone are not enough to cause the query graph to flatten under the right circumstances; since this optimization involves changing query structure we can add logic to recognize and collapse the correct interpolated query substructure before sending it to be processed. Such a transformation is typically referred to as a query rewrite. In Julien constructing a retrieval pipeline that uses one or more rewriters is no more difficult than adding a function call after initially building the query. Consider Listing 6.6, which shows the basic retrieval stack augmented with a query rewriter called Flattener, which performs the desired transformation.

Listing 6.6. Executing a basic retrieval stack in Julien with a simple rewrite.

1 /* i m p o r t e d c l a s s e s */

3 val i n d e x : I n d e x = I n d e x . d i s k ( ‘ ‘./ m y I n d e x ’ ’ ) 4

5 val q u e r y F e a t u r e s : F e a t u r e =

6 g e n e r a t e Q u e r y ( ‘ ‘2014 w i n t e r o l y m p i c s ’ ’ , i n d e x ) 7

8 val r e w r i t t e n : F e a t u r e = F l a t t e n e r . r e w r i t e ( q u e r y F e a t u r e s ) 9

10 val acc : A c c u m u l a t o r [ S c o r e d D o c u m e n t ] = 11 D e f a u l t A c c u m u l a t o r [ S c o r e d D o c u m e n t ]() 12

13 val r e s u l t s : Q u e r y R e s u l t s = Q u e r y P r o c e s s o r ( r e w r i t t e n , acc ) The only change comes on line 8, where the Flattener query rewriter is called to remove interpolated subqueries from the query wherever it can. Note that we explicitly show the insertion of the rewrite step here, but we can easily automate this process more formally by adding a container-typeRewritersmodule that could have rewriters inserted and sequentially run over the query operators before execution.

While this is similar to the traversals system in Galago, an important difference is that in Galago, the traversals operate over a tree of Node objects that may have no bearing on the actual constructed query - for example, the initial tree after parsing the query cannot be materialized into an executable query. A certain set of traversals must operate on the node tree in order to prepare it for execution. In contrast, in Julien the input to, and the output of, every rewriter module here is an executable graph of query operators. Therefore while a certain ordering of rewriters may produce more efficient executions of a particular query, any selection (and any order) of rewriters will produce a valid query operator graph. While this is a significant reduction in complexity, it does have consequences, which we discuss in Section 6.6.

Now that we have determined an appropriate insertion point into the control flow in query processing, all we need is to determine when the rewriter should actively restructure the query. In this case, we must determine the correct context in which this operator should occur (i.e., what is the local query structure) in order to correctly remove the operator.

Assume we are looking at a node N, its parent P, and N’s set of children C = c1, . . . , cj. We are interested in recognizing when we can safely moveci to be children of P, thus deleting N from the representation completely. We consider the set of classes that perform the same feature operation (e.g. a combine operator and its subclasses all sum over its children during evaluation) to be an equivalence class.

We can then use the equivalence class relation to ensure that two operators, which may have different complements of behaviors, are derived from the same base class which actually performs the function of the operator. Based on this definition, the decision criteria we use in the rewriter is simple in this case:

1. Both N and P must be Distributive.

2. N and P must be in an equivalence class.

This logic may seem overly restrictive, however it greatly simplifies the decision- making process during rewriting. Suppose either N or P is not Distributive; then if we attempted to perform the flattening operation by moving C under P, either we could not reweight the nodes in C (in the case where N is not Distributive), or we could not set C under P (in the case where P is not Distributive). The second

restriction, that both N and P be the same equivalence class, is to ensure that the same operation will be used over the modified set of children ofP.

As an example for the need of the second requirement, Figure 6.5 show a flattening operation when the upper and lower nodes are Distributive (as indicated by the double green outlines), but the operations they implement are different. Although the weight can be distributed over the leaves, the sum operations are deleted in the flattening operation, consequently changing the semantics of the retrieval model.

It is possible to loosen the second restriction to allow subtype equivalences be- tween P and N as long as the core mathematics in the subtype relationship is un- changed. We consider that exploration outside the scope of this work, and instead focus on the effects of single-type query flattening on execution efficiency.

Σ Σ

t1 tm tm+1 tm+n

Figure 6.5. Incorrectly flattening the query. The semantics of the query are changed because the lower summation operations are deleted.

As a final implementation note for this section, using the Distributive behavior, we can easily define the logic required to determine if a query is eligible for execution using theMaxscoreorWANDprocessors: if the root node of the graph

isDistributive, we can safely access the children features of the root and directly use them as the scorers in either Maxscoreor WAND.

Implementing Query Flattening in Julien

Problem: Bigger and Bigger Queries

Dynamic Optimization using Machine Learning