IT training marklogic cookbook powering search khotailieu

MarkLogic is a database capable of storing many types of data, but italso includes a search engine built into the core, complete with anintegrated suite of indexes working across multipl

Trang 3

Dave Cassel

MarkLogic Cookbook

Documents, Triples, and Values:

Powering Search

Boston Farnham Sebastopol Tokyo

Beijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 4

MarkLogic Cookbook

by Dave Cassel

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Shannon Cutt

Production Editor: Kristen Brown

Copyeditor: Sonia Saruba

Interior Designer: David Futato Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

August 2017: First Edition

Revision History for the First Edition

Trang 5

Table of Contents

Introduction v

1 Document Searches 1

Search by Root Element 1

Find Documents That Are Missing an Element 5

2 Scoring Search Results 7

Sort Results to Promote Recent Documents 7

Weigh Matches Based on Document Parts 9

3 Understanding Your Data and How It Gets Used 13

Logging Search Requests 13

Count Documents in Directories 16

4 Searching with the Optic API 19

Paging Over Results 19

Group By 22

Extract Content from Retrieved Documents 24

Select Documents Based on Criteria in Joined Documents 26

iii

Trang 7

MarkLogic is a database capable of storing many types of data, but italso includes a search engine built into the core, complete with anintegrated suite of indexes working across multiple data models.This combination allows for a simpler architecture (one softwaresystem to deploy, configure, and maintain rather than two), simplerapplication-level code (application code goes to one resource forquery and search, rather than two), and better security (because thesearch engine has the same security configuration as the databaseand is updated transactionally whenever data changes)

The recipes in this book, the second of a three-part series, provideguidance on how to solve common search-related problems Some

of the recipes work with older versions of MarkLogic, while otherstake advantage of newer feaures in MarkLogic 9

MarkLogic supports both XQuery and JavaScript as internal lan‐guages Most of the recipes in this book are written in JavaScript, buthave corresponding XQuery versions at http://developer.mark logic.com/recipes JavaScript is very well suited for JSON content,while XQuery is great for XML; both are natively managed inside ofMarkLogic

Recipes are a useful way to distill simple solutions to common prob‐lems—copy and paste these into MarkLogic’s Query Console oryour source code, and you’ve solved the problem In choosing rec‐ipes for this book, I looked for a couple of factors First, I wantedproblems that occur with some frequency Some problems in thisbook are more common than others, but all occur often enough inreal-world situations that one of my colleagues wrote down a solu‐tion Second, I selected four recipes that illustrate how to use the

v

Trang 8

new Optic API, to help developers get used to that feature Finally,some recipes require explanations that provide insight into how toapproach programming with MarkLogic.

Developers will get the most value from these recipes and theaccompanying discussions after they’ve worked with MarkLogic for

at least a few months and built an application or two If you’re justgetting started, I suggest spending some time on MarkLogic Univer‐sity classes first, then come back to this material

If you would like to suggest or request a recipe, please write to

on the content and made sure I actually got this done Thank you toall!

Trang 9

CHAPTER 1 Document Searches

Finding documents is a core feature for searching in MarkLogic.Searches often begin with looking for simple words or phrases Fac‐ets in the user interface, in the form of lists, graphs, or maps, allowusers to drill into results But MarkLogic’s Universal Index also cap‐tures the structure of documents

The recipes in this chapter take advantage of the Universal Index tofind documents with a specific root element and to look for docu‐ments that are missing some type of structure

Search by Root Element

Problem

You want to look for documents that have a particular root XMLelement or JSON property and combine that with other searchcriteria

Solution

Applies to MarkLogic versions 7 and higher

(: Return a query that finds documents with

: the specified root element :)

declare function local:query-root ( qname as xs:QName)

{

let ns := fn:namespace-uri-from-QName ( qname )

let prefix := if ( $ ns eq "" ) then "" else "pre:"

return

1

Trang 10

You can then call it like this:

declare namespace ml "http://marklogic.com" ;

It’s easy to find all the documents that have a particular root element

or property: use XPath (/ml:base) However, that limits the othersearch criteria you can use For instance, you can’t combine a

cts:collection-query with XPath What we need is a way toexpress /ml:base as a cts:query

The local:query-root function in the solution returns a query that finds the target element as a root We’re using a bit oftrickery to get there (including the fact that cts:term-query is anundocumented function) Let’s dig in a bit deeper to see what’shappening

cts:term-We can use xdmp:plan to ask MarkLogic how it will evaluate anXPath expression like this:

declare namespace ml "http://marklogic.com" ;

xdmp:plan (/ml:base)

The result looks like this (note that if you run this, the identifierswill be different):

Trang 11

<qry:info-trace>Step 2 is searchable: ml:base</qry:info-trace>

<qry:info-trace>Path is fully searchable.</qry:info-trace>

<qry:info-trace>Gathering constraints.</qry:info-trace>

<qry:info-trace>Executing search.</qry:info-trace>

or properties Exactly what is recorded depends on the settings youhave configured in your database In each case, the word or struc‐ture is mapped to a key

Take another look at the <final-plan> element—this is the querythat MarkLogic will run We can see that it’s using a term query, andthe annotation tells us what it means A bit of XPath pulls out thatindex key, which we then use to build a cts:query that we can com‐bine with other queries

Search by Root Element | 3

Trang 12

declare namespace qry "http://marklogic.com/cts/query" ;

declare namespace ml = "http://marklogic.com" ;

xdmp:plan (/ml:base)/qry:final-plan//qry:term-query/qry:key

So why are we using xdmp:value? We can run xdmp:plan with anexplicit XPath expression, but if we want to work with a dynamicpath (provided at runtime), then we can’t build a string and pass it

to xdmp:plan However, we can build a string that includes the refer‐ence to xdmp:plan and then pass the whole thing to xdmp:value,which will evaluate it xdmp:value also accepts bindings, whichallow us to use namespaces in the string we pass into xdmp:plan

I used xdmp:with-namespaces so that the function can be contained Without that, the code would require the qry namespacedeclaration at the top of the module where the local:query-root

self-function lives

One more interesting bit: notice $prefix as part of the string passed

to xdmp:value With a QName, there might be a prefix (if construc‐ted with xs:QName) or there might not be (if constructed with

fn:QName or if the QName doesn’t use a namespace) To handle allthese cases, the recipe assigns whatever namespace is present to theprefix “pre.” However, if the namespace URI is the empty string,then we skip the prefix in the XPath that we send to xdmp:plan.That last complexity is there because the parameter to the functiontakes an xs:QName The function could be written to take a string(like /ml:base), or a namespace and a localname Requiring an

xs:QName lets the caller build the QName using any of the availablemethods (xs:QName, fn:QName; note that this approach doesn’t cre‐ate any prefix), but also limits what goes into xdmp:value Keepingtight control over this data typing is important to prevent codeinjection

See Also

• Documentation: “Understanding Namespaces in XQuery”

(XQuery and XSLT Reference Guide)

Trang 13

Find Documents That Are Missing an Element Problem

You want to find all XML documents that are missing a particularelement This can be used to find documents that have not yet gonethrough some transformation

cts:element-query is a useful way to constrain a search to part of adocument The function restricts the nested query to matchingwithin the specified XML element Without the cts:not-query, this

same approach can be used to find documents that do have a partic‐

ular element, or to find terms that occur within a specific element.The query passed to cts:element-query is cts:true-query forMarkLogic 8 and later, and cts:and-query(()) for MarkLogic 7and earlier cts:true-query does what it sounds like—it matcheseverything Passed to cts:element-query, this provides a simpleway to test for the existence of an element If you’re using a version

of MarkLogic that predates cts:true-query, the way to simulate it

is to use cts:and-query and pass in the empty sequence to it An

Find Documents That Are Missing an Element | 5

Trang 14

and-query matches if all queries passed into it are true; if none arepassed in, then it matches, thus making cts:and-query(()) workthe same as cts:true-query.

Trang 15

CHAPTER 2 Scoring Search Results

MarkLogic is a database that contains a powerful search engine.There are advantages to this, such as the fact that data does not need

to be replicated to a search engine to provide that functionality,search results are up to date as soon as a transaction completes, andthe search is subject to the same security as the database content.While running a search, MarkLogic assigns a score that accounts forthe frequency of your target terms within the database, the fre‐quency of the terms within each document, and the length of thedocument For a detailed explanation of how scores are calculated,see “Understanding How Scores and Relevance are Calculated” inthe Search Developer’s Guide

The recipes in this chapter show some tricks to affect the way searchresults are scored

Sort Results to Promote Recent Documents Problem

Show more recent documents higher in a result set than older docu‐ments For instance, when searching blog posts, more recent content

is more likely to be current and relevant than older content

7

Trang 16

With server-side code:

var jsearch require ( '/MarkLogic/jsearch.sjs' );

Trang 17

preferring recent content, or in finding documents with a geospatialcomponent near a particular point.

In the example above, our content documents have an elementcalled pubdate If we set up a dateTime index on this element, then

we can do range queries We might use those to limit our results tojust content within the last year, but in this case, the goal is just toaffect the scoring As such, the JSearch example performs a <= com‐parison with the current date and time—we’d expect this to matchall documents (note that documents without a score will fail tomatch and will drop out of the result set) The current date and timeprovides an anchor for the comparison; the distance between a doc‐ument’s pubdate value and the anchor value is fed into the reciprocalscore function This means that the more recent documents will get

a boost in score You may want to adjust the weight parameter to theelement range query to tune how much impact recency has

To use this approach with the REST API, create a range constraintand specify the score-function=reciprocal range option You’llneed to provide an anchor point with the constraint, for instance

your middle tier, combining it with the user inputs

The score-function option can be reversed by specifying the linear function This rewards values that are further away from theanchor value In the case of pubdate, score-function=linear

would favor older documents This could be useful for a contentmanager looking for content that needs to be updated

See Also

• XQuery version: “Sort results to promote recent documents”

• Documentation: “Range Query Scoring Examples” (SearchDeveloper’s Guide)

Weigh Matches Based on Document Parts Problem

When doing a text search, some matches are more valuable thanothers For instance, if you’re searching for a book, a match in an

Weigh Matches Based on Document Parts | 9

Trang 18

ISBN field is a sure thing, a match on the title or author is very use‐ful, a match in the abstract is good, and a match in the rest of thetext is a normal hit.

Solution

Part of the challenge of rewarding matches from different parts of adocument is determining how to weight them To do that, start with

an easily adjustable query, like this one:

let text := ( "databases" )

Trang 19

Here’s an updated query to use the field:

let text := ( "databases" )

to see first? Probably the one with the title match, since the title willlikely have key terms in it The summary is a bit bigger, butdescribes the general purpose of the content The rest of the contentmay have lots of terms that are much more broadly related This isthe intuition that drives awarding higher scores to matches in differ‐ent parts of a document

MarkLogic’s cts: queries take a weight parameter The default value

is 1.0, but you can set it in a range from 64 to -16 The higher thevalue, the more points a match earns Since we’re using

Weigh Matches Based on Document Parts | 11

Trang 20

cts:element-query, we need to turn on the word-positions and

element-word-positions indexes

The biggest challenge with this scoring is figuring out how to weightthe various parts of the document How much more relevant is amatch in the title than a match in the summary? The answer will beapplication-specific and requires experimentation Setting up an or-query makes it easy to run a set of experiments

Once you have settled on the weights, you can simplify the querycode by creating a field The field will include specification of thepaths and their relative weights

Trang 21

CHAPTER 3 Understanding Your Data and

How It Gets Used

MarkLogic provides a platform for storing large amounts of hetero‐geneous data Understanding what a database holds and how yourusers interact with it is key to improving the content over time Thefirst recipe in this chapter shows how to log the searches that yourusers are running Based on this information, you can discover gaps

in your content or see what provides the best draw to your applica‐tion The second recipe analyzes how content is divided amongdirectories, which are likely used to contain logical or physical seg‐ments of your data

Logging Search Requests

Problem

Record searches run by users, in order to build a recommendationsystem, understand user needs, or determine what type of content toadd The goal is to record more information than the access logswould provide, and perhaps to associate it with user profiles

Solution

There are a variety of ways to implement your search feature If youare using XQuery or JavaScript main modules to provide this

13

Định dạng
Số trang	36
Dung lượng	9,07 MB