Open Source Search Engines

2.4 Query Languages in Information Retrieval

2.4.2 Open Source Search Engines

The evolution of open source search engines reflects the gravitation of IR research towards structured queries as well.

One of the earliest open source systems in modern retrieval is the SMART retrieval system (Salton, 1971). SMART used the Vector Space Model (Salton, Wong,

& Yang, 1975), and served as the proving ground for a significant portion of the com- munity’s understanding of ranked retrieval. While groundbreaking in many ways, the input interface to SMART is relatively simplistic, and queries can only be en- tered as simple strings. This kind of interface was standard for information retrieval systems through the 1980s.

The next major advancement came from the INQUERY retrieval system (Callan, Croft, & Harding, 1992). INQUERY was built to implement the Inference Network framework (Turtle & Croft, 1990), a subset of thegraphical models domain of reason- ing (Manning, Raghavan, & Sch¨utze, 2008; Koller & Friedman, 2009). INQUERY supplied an expressive query language that allowed for query operators to be used on base query terms. Additionally, some operators could be applied over other operators, allowing for query construction using hierarchical element structure. As

2Note that NOT may only apply to a conjunction (i.e. AND (NOT x)). The disjunctive form (i.e. OR (NOT x)) is not informative for boolean query processing.

mentioned earlier, expressing queries with structure was not new, but this represents the first case of structural queries being used in a ranked retrieval setting. The effects of joining these two strategies can be seen in many modern day commercial and open source systems.

The Indri retrieval system (Strohman, Metzler, Turtle, & Croft, 2005) was built as a system that used Inference Network (and the query language coincident with it) to implement Language Models (Ponte & Croft, 1998). Indri added several new operators to the base set of INQUERY - while this increased what kinds of belief estimations could be made in Language Models, it did not significantly change the nature of query expression as INQUERY had.

The Galago (Strohman, 2007; Croft, Metzler, & Strohman, 2010) retrieval system serves as an evolutionary step past systems built to statically support a single retrieval model. Galago is not tied to a specific retrieval model, and instead can support arbitrary functions built on term-,document-, and collection-level statistics supplied from the index. The system ships with several retrieval models implemented, and provides several mechanisms for adding new operators. In a sense, this means Galago does not have a bounded query language, which makes it the most expressive retrieval system known to date.

Several other open source systems have been built to provide similar structured query elements as the original INQUERY system. Zettair is the most straightfor- ward system, allowing for use of unigrams and phrase searches in ranked and Boolean queries (Zobel, Williams, Scholer, Yiannis, & Hein, 2004). A brief view of the Ter- rier (University of Glasgow, 2011) retrieval system’s query language shows the use of

similar constructs: synonyms, ordered and unordered windows, fields, and Boolean search. The Wumpus system makes use of generalized concordance lists (GCLs) (Clarke, Cormack, & Burkowski, 1995), which allow for interesting interactions be- tween extents of text in documents, but do not introduce any new widely adopted operations. Most of the operations enabled by GCLs are directly expressible in the INQUERY query language. The Ivory system3 implements the SMRF framework (Search with Markov Random Fields) (Lin, Metzler, Elsayed, & Wang, 2010).

The SMRF framework was first implemented informally in the Indri search engine (Metzler & Croft, 2005), using the query language built in with the Indri. Therefore we consider the query language no more expressive (and possibly a restriction of) Indri.

One of the newest open source search engines is ATIRE, built at the University of Otago4. In an odd reversal of the trend, ATIRE was built without support for term dependencies (i.e., phrases or window operators), as the implementing group has not been convinced that supporting term dependencies is worth the increased complexity in retrieval inference (Trotman, 2012). This point of view shows that even 20 years after the first INQUERY implementation, building a search engine to support structure in ranked retrieval is still not a trivial matter (although it is worthy to note that commercial search engines have committed to supporting several types of structure, including phrases).

3http://lintool.github.io/Ivory/

4http://atire.org/index.php?title=Main Page

In addition to the academic systems mentioned above, several other open source implementations exist, in many cases as alternatives for “single-site” search installa- tions. These systems tend to be geared for commercial-scale tasks, but are available for use by anyone willing to set the system up. The most well-known of these is the Apache Lucene Core system5. Lucene began in 1999 by Doug Cutting as an exploration of implementing a retrieval system in Java. Over time, the project was adopted by the Apache Foundation, and over several iterations, Lucene grew to focus on meeting the needs of specific clients. As a result, the query language of Lucene covers the functionality offered by the commercial search systems, as well as several extensions, including edit distance matching, ranged Boolean queries on fields and unigram wildcard matches (Apache Software Foundation, 2012).

If we compare the query language elements available in each of the systems re- viewed, two tiers of “complexity” emerge: the first is support for just specifying terms, as in SMART, Zettair, or ATIRE. These systems can provide good support for simple keyword retrieval models, however expressing a more sophisticated model may be difficult without major implementation changes. The second tier is defined largely by INQUERY: systems in this tier support structured query constructs, allowing them to express a much wider range of queries than the simpler systems.

Although Galago technically sits above this tier due to the flexibility of the query language, the tier above the current one is unclear, as no significantly new retrieval models have been implemented in the system (possibly due to a lack of retrieval mod-

5http://lucene.apache.org

els that clearly “out-express” the one put forth by INQUERY). A formal exploration of new types of queries is beyond the scope of this thesis, we discuss possibilities for research in Section 7.2. Consequently, we place Galago in the second tier, along with other systems which sport structured queries.

Problem: Bigger and Bigger Queries

Dynamic Optimization using Machine Learning