Tài liệu Solr 1.4 Enterprise Search Server- P1 pptx

He later went on to use the Lucene based "Compass" library to construct a very basic search server, similar in spirit to Solr.. Although preferring open source solutions, David has also

Trang 1

Solr 1.4 Enterprise Search Server

Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more

David Smiley Eric Pugh

BIRMINGHAM - MUMBAI

Trang 2

All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews

Every effort has been made in the preparation of this book to ensure the accuracy

of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals

However, Packt Publishing cannot guarantee the accuracy of this information

First published: August 2009Production Reference: 1120809

Published by Packt Publishing Ltd

32 Lincoln Road Olton

Birmingham, B27 6PA, UK

ISBN 978-1-847195-88-3www.packtpub.comCover Image by Harmeet Singh (singharmeet@yahoo.com)

Trang 3

Authors

David Smiley Eric Pugh

Reviewers

James Brady Jerome Eteve

Trang 4

About the Authors

Born to code, David Smiley is a senior software developer and loves programming He has 10 years of experience in the defense industry at MITRE, using Java and various web technologies David is a strong believer in the opensource development model and has made small contributions to various projects over the years

David began using Lucene way back in 2000 during its infancy and was immediately excited by it and its future potential He later went on to use the Lucene based

"Compass" library to construct a very basic search server, similar in spirit to Solr

Since then, David has used Solr in a major search project and was able to contribute modifications back to the Solr community Although preferring open source

solutions, David has also been trained on the commercial Endeca search platform and is currently using that product as well as Solr for different projects

Trang 5

Most, if not all, authors seem to dedicate their book to someone As simply a reader of books, I have thought of this seeming prerequisite

as customary tradition That was my feeling before I embarked on writing about Solr, a project that has sapped my previously "free"

time on nights and weekends for a year I chose this sacrifice and would not change it, but my wife, family, and friends did not choose

it I am married to my lovely wife Sylvie who has sacrificed easily

as much as I have to complete this book She has suffered through this time with an absentee husband while bearing our first child—

Camille She was born about a week before the completion of my first draft and has been the apple of my eye ever since I officially dedicate this book to my wife Sylvie and my daughter Camille, whom I both lovingly adore I also pledge to read book

dedications with newfound firsthand experience at what the dedication represents

I would also like to thank others who helped bring this book to fruition Namely, if it were not for Doug Cutting creating Lucene with an open source license, there would be no Solr Furthermore, CNet's decision to open source what was an in-house project, Solr itself in 2006, deserves praise Many corporations do not understand that open source isn't just "free code" you get for free that others wrote; it is an opportunity to let your code flourish on the outside instead of it withering inside Finally, I thank the team at Packt who were particularly patient with me as a first-time author writing at a pace that left a lot to be desired

Last but not least, this book would not have been completed in a reasonable time were it not for the assistance of my contributing author, Eric Pugh His perspectives and experiences have complemented mine so well that I am absolutely certain the quality of this book is much better than what I could have done alone

Thank you all

Trang 6

Fascinated by the 'craft' of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past five years He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we move from the read/write Web to the read/write/share Web.

In biotech, financial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software As a speaker, he has advocated the advantages of Agile practices in software development

Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned

up and enhanced by three other individuals, demonstrating the power of the open source model to build great code collaboratively SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4

He blogs at http://www.opensourceconnections.com/blog/

Trang 7

Throughout my life I have been helped by so many people, but all too rarely do I get to explicitly thank them This book is arguable one of the high points of my career, and as I wrote it, I thought about all the people who have provided encouragement, mentoring, and the occasional push to succeed First off, I would like to thank Erik Hatcher, author, entrepreneur, and great family man for introducing

me to the world of open source software My first hesitant patch

to Ant was made under his tutelage, and later my interest in Solr was fanned by his advocacy Thanks to Harry Sleeper for taking

a chance on a first time conference speaker; he moved me from thinking of myself as a developer improving myself to thinking of myself as a consultant improving the world (of software!) His team

at MITRE are some of the most passionate developers I have met, and it was through them I met my co-author David I owe a huge debt of gratitude to David Smiley He has encouraged me, coached

me, and put up with my lack of respect for book deadlines, making this book project a very positive experience! I look forward to the next one With my new son Morgan at home, I could only have done this project with a generous support of time from my company, OpenSource Connections I am incredibly proud of what o19s

is accomplishing!

Lastly, to the all the folks in the Solr/Lucene community who took the time to review early drafts and provide feedback: Solr is at the tipping point of becoming the "it" search engine because of your passion and commitment

I am who I am because of my wife, Kate Schweetie, real life for me began when we met Thank you

Trang 8

About the Reviewers

James Brady is an entrepreneur and software developer living in San Francisco,

CA Originally from England, James discovered his passion for computer science and programming while at Cambridge University Upon graduation, James worked as a software engineer at IBM's Hursley Park laboratory—a role which taught him many things, most importantly, his desire to work in a small company

In January 2008, James founded WebMynd Corp., which received angel funding from the Y Combinator fund, and he relocated to San Francisco WebMynd is one

of the largest installations of Solr, indexing up to two million HTML documents per day, and making heavy use of Solr's multicore features to enable a partially active index

Jerome Eteve holds a BSC in physics, maths and computing and an MSC in IT and bioinformatics from the University of Lille (France) After starting his career in the field of bioinformatics, where he worked as a biological data management and analysis consultant, he's now a senior web developer with interests ranging from database level issues to user experience online He's passionate about open source technologies, search engines, and web application architecture At present, he is working since 2006 for Careerjet Ltd, a worldwide job search engine

Trang 9

Table of Contents

One combined index or multiple indices 31

Step 1: Determine which searches are going to be powered by Solr 35

Trang 10

Table of Contents

[ ii ]

Denormalizing—"one-to-one" associated data 36 Denormalizing—"one-to-many" associated data 36Step 4: (Optional) Omit the inclusion of fields

Trang 11

Table of Contents

[ iii ]

Solr's generic XML structured data representation 92

Limitations of prohibited clauses in sub-expressions 102

Trang 14

Table of Contents

[ vi ]

Take a walk on the wild side! Use JRuby to extract JMX information 215

Building a Solr powered artists autocomplete widget with

Trang 15

Table of Contents

[ vii ]

Populating MyFaves relational database from Solr 256

Optimizing a single Solr server (Scale High) 276

Moving to multiple Solr servers (Scale Wide) 289

Combining replication and sharding (Scale Deep) 298

Trang 17

Text search has been around for perhaps longer than we all can remember Just about all systems, from client installed software to web sites to the web itself, have search Yet there is a big difference between the best search experiences and the mediocre, unmemorable ones If you want the application you're building to stand out above the rest, then it's got to have great search features If you leave this to the capabilities of a database, then it's near impossible that you're going to get a great search experience, because it's not going to have features that users come to expect in

a great search With Solr, the leading open source search server, you'll tap into a host

of features from highlighting search results to spell-checking to faceting

As you read Solr Enterprise Search Server you'll be guided through all of the aspects

of Solr, from the initial download to eventual deployment and performance optimization Nearly all the options of Solr are listed and described here, thus making this book a resource to turn to as you implement your Solr based solution The book contains code examples in several programming languages that explore various integration options, such as implementing query auto-complete in a web browser and integrating a web crawler You'll find these working examples in the online supplement to the book along with a large, real-world, openly available data set from MusicBrainz.org Furthermore, you will also find instructions on accessing a Solr image readily deployed from within Amazon's Elastic Compute Cloud

Solr Enterprise Search Server targets the Solr 1.4 version However, as this book went

to print prior to Solr 1.4's release, two features were not incorporated into the book:

search result clustering and trie-range numeric fields

Trang 18

[ 2 ]

What this book covers

Chapter 1, Quick Starting Solr introduces Solr to the reader as a middle ground

between database technology and document/web crawlers The reader is guided through the Solr distribution including running the sample configuration with sample data

Chapter 2, The Schema and Text Analysis is all about Solr's schema The schema

design is an important first order of business along with the related text analysis configuration

Chapter 3, Indexing Data details several methods to import data; most of them can

be used to bring the MusicBrainz data set into the index A popular Solr extension called the DataImportHandler is demonstrated too

Chapter 4, Basic Searching is a thorough reference to Solr's query syntax from the

basics to range queries Factors influencing Solr's scoring algorithm are explained here, as well as diagnostic output essential to understanding how the query worked and how a score is computed

Chapter 5, Enhanced Searching moves on to more querying topics Various score

boosting methods are explained from those based on record-level data to those that match particular fields or those that contain certain words Next, faceting is a major subject area of this chapter Finally, the term auto-complete is demonstrated, which

is implemented by the faceting mechanism

Chapter 6, Search Components covers a variety of searching extras in the form of

Solr "components", namely, spell-check suggestions, highlighting search results, computing statistics of numeric fields, editorial alterations to specific user queries, and finding other records "more like this"

Chapter 7, Deployment transits from running Solr from a developer-centric perspective

to deploying and running Solr as a deployed production enterprise service that is secure, has robust logging, and can be managed by System Administrators

Chapter 8, Integrating Solr surveys a plethora of integration options for Solr, from

supported client libraries in Java, JavaScript, and Ruby, to being able to consume Solr results in XML, JSON, and even PHP syntaxes We'll look at some best practices and approaches for integrating Solr into your web application

Chapter 9, Scaling Solr looks at how to scale Solr up and out to avoid meltdown and

meet performance expectations This information varies from small changes of configuration files to architectural options

Trang 19

[ 3 ]

Who this book is for

This book is for developers who would like to use Solr to implement a search capability for their applications You need only to have basic programming skills to use Solr; extending or modifying Solr itself requires Java programming Knowledge

of Lucene, the foundation of Solr, is certainly a bonus

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information Here are some examples of these styles and an explanation of their meaning

Code words in text are shown as follows: "These are essentially defaults for searches that are processed by Solr request handlers defined in solrconfig.xml."

A block of code is set as follows:

<! <defaultSearchField>text</defaultSearchField>

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

<str>mccm.pdf</str>

</arr>

Any command-line input or output is written as follows:

>> curl http://localhost:8983/solr/karaoke/update/ -H "Content-Type:

text/xml" data-binary '<commit waitFlush="false"/>'

New terms and important words are shown in bold Words that you see on the

screen, in menus or dialog boxes for example, appear in the text like this: "Take for

example the Top Voters section ".

Warnings or important notes appear in a box like this

Tips and tricks appear like this

Trang 20

to develop titles that you really get the most out of

To send us general feedback, simply send an email to feedback@packtpub.com, and mention the book title via the subject of your message

If there is a book that you need and would like to see us publish, please

send us a note in the SUGGEST A TITLE form on www.packtpub.com or email suggest@packtpub.com

If there is a topic that you have expertise in and you are interested in either writing

or contributing to a book on, see our author guide on www.packtpub.com/authors

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase

Downloading the example code for the book

Visit http://www.packtpub.com/files/code/5883_Code.zip to directly download the example code

The downloadable files contain instructions on how to use them

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book If you find any errata, please report them by visiting http://www.packtpub

com/support, selecting your book, clicking on the let us know link, and entering the

details of your errata Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata Any existing errata can be viewed

by selecting your title from http://www.packtpub.com/support

Trang 21

[ 5 ]

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media

At Packt, we take the protection of our copyright and licenses very seriously If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or web site name immediately so that we can pursue a remedy

Please contact us at copyright@packtpub.com with a link to the suspected pirated material

We appreciate your help in protecting our authors, and our ability to bring you valuable content

Questions

You can contact us at questions@packtpub.com if you are having a problem with any aspect of the book, and we will do our best to address it

Trang 23

Quick Starting Solr

Welcome to Solr! You've made an excellent choice in picking a technology to power your searching needs In this chapter, we're going to cover the following topics:

An overview of what Solr and Lucene are all aboutWhat makes Solr different from other database technologiesHow to get Solr, what's included, and what is where

Running Solr and importing sample data

A quick tour of the interface and key configuration files

An introduction to Solr

Solr is an open source enterprise search server It is a mature product powering search for public sites like CNet, Zappos, and Netflix, as well as intranet sites It is written in Java, and that language is used to further extend/modify Solr However, being a server that communicates using standards such as HTTP and XML,

knowledge of Java is very useful but not strictly a requirement In addition to the standard ability to return a list of search results for some query, it has numerous other features such as result highlighting, faceted navigation (for example, the ones found on most e-commerce sites), query spell correction, auto-suggest queries, and

"more like this" for finding similar documents

Common Solr Usage

Trang 24

Quick Starting Solr

[ 8 ]

Lucene, the underlying engine

Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it Lucene is an open source, high-performance text search engine library

Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community Being just a code library, Lucene is not a server and certainly isn't a web crawler either This is an important fact There aren't even any configuration files In order to use Lucene directly, one writes code to store and query an index stored on a disk The major features found in Lucene are as follows:

A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms

A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched

A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches

A good scoring algorithm based on sound Information Retrieval (IR)

principles to produce the more likely candidates first, with flexible means

to affect the scoring

A highlighter feature to show words found in context

A query spellchecker based on indexed content

For even more information on the query spellchecker, check out the Lucene In Action book (LINA for short) by Erik Hatcher and Otis Gospodnetić

Solr, the Server-ization of Lucene

With the definition of Lucene behind us, Solr can be described succinctly as the server-ization of Lucene However, it is definitely not a thin wrapper around the Lucene libraries Most of Solr's features are distinct from Lucene, such as faceting, but not far into the implementation The line is often blurred as to what is Solr and what is Lucene Without further adieu, here is the major feature-set in Solr:

HTTP request processing for indexing and querying documents

Several caches for faster query responses

A web-based administrative interface including:

Runtime performance statistics including cache hit/miss rates

Trang 25

Chapter 1

[ 9 ]

A query form to search the index

A schema browser with histograms of popular terms along with some statistics

Detailed breakdown of scoring mathematics and text analysis phases

Configuration files for the schema and the server itself (in XML)

Solr adds to Lucene's text analysis library and makes it configurable through XML

Introduces the notion of a field type (this is important yet surprisingly not in Lucene) Types are present for dates and special sorting concerns

The disjunction-max query handler is more usable by end user queries and

applications than Lucene's underlying raw queries

Faceting of query results

A spell check plugin used for making alternative query suggestions (that is,

"did you mean _")

A more like this plugin to list documents that are similar to a

chosen document

A distributed Solr server model with supporting scripts to support larger scale deployments

These features will be covered in more detail in later chapters

Comparison to database technology

Knowledge of relational databases (often abbreviated RDBMS or just database for short) is an increasingly common skill that developers possess A database and a [Lucene] search index aren't dramatically different conceptually So let's start off

by assuming that you know database basics, and I'll describe how a search index

Tiêu đề	Solr 1.4 Enterprise Search Server
Tác giả	David Smiley, Eric Pugh
Trường học	Packt Publishing
Thể loại	sách
Năm xuất bản	2009
Thành phố	Birmingham

Định dạng
Số trang	50
Dung lượng	886,39 KB