MarkLogic is a database capable of storing many types of data, but italso includes a search engine built into the core, complete with anintegrated suite of indexes working across multipl
Trang 3Dave Cassel
MarkLogic Cookbook
Documents, Triples, and Values:
Powering Search
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing
Trang 4MarkLogic Cookbook
by Dave Cassel
Copyright © 2017 O’Reilly Media All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Shannon Cutt
Production Editor: Kristen Brown
Copyeditor: Sonia Saruba
Interior Designer: David Futato Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
August 2017: First Edition
Revision History for the First Edition
Trang 5Table of Contents
Introduction v
1 Document Searches 1
Search by Root Element 1
Find Documents That Are Missing an Element 5
2 Scoring Search Results 7
Sort Results to Promote Recent Documents 7
Weigh Matches Based on Document Parts 9
3 Understanding Your Data and How It Gets Used 13
Logging Search Requests 13
Count Documents in Directories 16
4 Searching with the Optic API 19
Paging Over Results 19
Group By 22
Extract Content from Retrieved Documents 24
Select Documents Based on Criteria in Joined Documents 26
iii
Trang 7MarkLogic is a database capable of storing many types of data, but italso includes a search engine built into the core, complete with anintegrated suite of indexes working across multiple data models.This combination allows for a simpler architecture (one softwaresystem to deploy, configure, and maintain rather than two), simplerapplication-level code (application code goes to one resource forquery and search, rather than two), and better security (because thesearch engine has the same security configuration as the databaseand is updated transactionally whenever data changes)
The recipes in this book, the second of a three-part series, provideguidance on how to solve common search-related problems Some
of the recipes work with older versions of MarkLogic, while otherstake advantage of newer feaures in MarkLogic 9
MarkLogic supports both XQuery and JavaScript as internal lan‐guages Most of the recipes in this book are written in JavaScript, buthave corresponding XQuery versions at http://developer.mark logic.com/recipes JavaScript is very well suited for JSON content,while XQuery is great for XML; both are natively managed inside ofMarkLogic
Recipes are a useful way to distill simple solutions to common prob‐lems—copy and paste these into MarkLogic’s Query Console oryour source code, and you’ve solved the problem In choosing rec‐ipes for this book, I looked for a couple of factors First, I wantedproblems that occur with some frequency Some problems in thisbook are more common than others, but all occur often enough inreal-world situations that one of my colleagues wrote down a solu‐tion Second, I selected four recipes that illustrate how to use the
v
Trang 8new Optic API, to help developers get used to that feature Finally,some recipes require explanations that provide insight into how toapproach programming with MarkLogic.
Developers will get the most value from these recipes and theaccompanying discussions after they’ve worked with MarkLogic for
at least a few months and built an application or two If you’re justgetting started, I suggest spending some time on MarkLogic Univer‐sity classes first, then come back to this material
If you would like to suggest or request a recipe, please write to
on the content and made sure I actually got this done Thank you toall!
Trang 9CHAPTER 1 Document Searches
Finding documents is a core feature for searching in MarkLogic.Searches often begin with looking for simple words or phrases Fac‐ets in the user interface, in the form of lists, graphs, or maps, allowusers to drill into results But MarkLogic’s Universal Index also cap‐tures the structure of documents
The recipes in this chapter take advantage of the Universal Index tofind documents with a specific root element and to look for docu‐ments that are missing some type of structure
Search by Root Element
Problem
You want to look for documents that have a particular root XMLelement or JSON property and combine that with other searchcriteria
Solution
Applies to MarkLogic versions 7 and higher
(: Return a query that finds documents with
: the specified root element :)
declare function local:query-root ( qname as xs:QName)
{
let ns := fn:namespace-uri-from-QName ( qname )
let prefix := if ( $ ns eq "" ) then "" else "pre:"
return
1
Trang 10You can then call it like this:
declare namespace ml "http://marklogic.com" ;
It’s easy to find all the documents that have a particular root element
or property: use XPath (/ml:base) However, that limits the othersearch criteria you can use For instance, you can’t combine a
cts:collection-query with XPath What we need is a way toexpress /ml:base as a cts:query
The local:query-root function in the solution returns a query that finds the target element as a root We’re using a bit oftrickery to get there (including the fact that cts:term-query is anundocumented function) Let’s dig in a bit deeper to see what’shappening
cts:term-We can use xdmp:plan to ask MarkLogic how it will evaluate anXPath expression like this:
declare namespace ml "http://marklogic.com" ;
xdmp:plan (/ml:base)
The result looks like this (note that if you run this, the identifierswill be different):
Trang 11<qry:info-trace>Step 2 is searchable: ml:base</qry:info-trace>
<qry:info-trace>Path is fully searchable.</qry:info-trace>
<qry:info-trace>Gathering constraints.</qry:info-trace>
<qry:info-trace>Executing search.</qry:info-trace>
or properties Exactly what is recorded depends on the settings youhave configured in your database In each case, the word or struc‐ture is mapped to a key
Take another look at the <final-plan> element—this is the querythat MarkLogic will run We can see that it’s using a term query, andthe annotation tells us what it means A bit of XPath pulls out thatindex key, which we then use to build a cts:query that we can com‐bine with other queries
Search by Root Element | 3
Trang 12declare namespace qry "http://marklogic.com/cts/query" ;
declare namespace ml = "http://marklogic.com" ;
xdmp:plan (/ml:base)/qry:final-plan//qry:term-query/qry:key
So why are we using xdmp:value? We can run xdmp:plan with anexplicit XPath expression, but if we want to work with a dynamicpath (provided at runtime), then we can’t build a string and pass it
to xdmp:plan However, we can build a string that includes the refer‐ence to xdmp:plan and then pass the whole thing to xdmp:value,which will evaluate it xdmp:value also accepts bindings, whichallow us to use namespaces in the string we pass into xdmp:plan
I used xdmp:with-namespaces so that the function can be contained Without that, the code would require the qry namespacedeclaration at the top of the module where the local:query-root
self-function lives
One more interesting bit: notice $prefix as part of the string passed
to xdmp:value With a QName, there might be a prefix (if construc‐ted with xs:QName) or there might not be (if constructed with
fn:QName or if the QName doesn’t use a namespace) To handle allthese cases, the recipe assigns whatever namespace is present to theprefix “pre.” However, if the namespace URI is the empty string,then we skip the prefix in the XPath that we send to xdmp:plan.That last complexity is there because the parameter to the functiontakes an xs:QName The function could be written to take a string(like /ml:base), or a namespace and a localname Requiring an
xs:QName lets the caller build the QName using any of the availablemethods (xs:QName, fn:QName; note that this approach doesn’t cre‐ate any prefix), but also limits what goes into xdmp:value Keepingtight control over this data typing is important to prevent codeinjection
See Also
• Documentation: “Understanding Namespaces in XQuery”
(XQuery and XSLT Reference Guide)
Trang 13Find Documents That Are Missing an Element Problem
You want to find all XML documents that are missing a particularelement This can be used to find documents that have not yet gonethrough some transformation
cts:element-query is a useful way to constrain a search to part of adocument The function restricts the nested query to matchingwithin the specified XML element Without the cts:not-query, this
same approach can be used to find documents that do have a partic‐
ular element, or to find terms that occur within a specific element.The query passed to cts:element-query is cts:true-query forMarkLogic 8 and later, and cts:and-query(()) for MarkLogic 7and earlier cts:true-query does what it sounds like—it matcheseverything Passed to cts:element-query, this provides a simpleway to test for the existence of an element If you’re using a version
of MarkLogic that predates cts:true-query, the way to simulate it
is to use cts:and-query and pass in the empty sequence to it An
Find Documents That Are Missing an Element | 5
Trang 14and-query matches if all queries passed into it are true; if none arepassed in, then it matches, thus making cts:and-query(()) workthe same as cts:true-query.
Trang 15CHAPTER 2 Scoring Search Results
MarkLogic is a database that contains a powerful search engine.There are advantages to this, such as the fact that data does not need
to be replicated to a search engine to provide that functionality,search results are up to date as soon as a transaction completes, andthe search is subject to the same security as the database content.While running a search, MarkLogic assigns a score that accounts forthe frequency of your target terms within the database, the fre‐quency of the terms within each document, and the length of thedocument For a detailed explanation of how scores are calculated,see “Understanding How Scores and Relevance are Calculated” inthe Search Developer’s Guide
The recipes in this chapter show some tricks to affect the way searchresults are scored
Sort Results to Promote Recent Documents Problem
Show more recent documents higher in a result set than older docu‐ments For instance, when searching blog posts, more recent content
is more likely to be current and relevant than older content
7
Trang 16Applies to MarkLogic versions 8 and higher
With server-side code:
var jsearch require ( '/MarkLogic/jsearch.sjs' );
Trang 17preferring recent content, or in finding documents with a geospatialcomponent near a particular point.
In the example above, our content documents have an elementcalled pubdate If we set up a dateTime index on this element, then
we can do range queries We might use those to limit our results tojust content within the last year, but in this case, the goal is just toaffect the scoring As such, the JSearch example performs a <= com‐parison with the current date and time—we’d expect this to matchall documents (note that documents without a score will fail tomatch and will drop out of the result set) The current date and timeprovides an anchor for the comparison; the distance between a doc‐ument’s pubdate value and the anchor value is fed into the reciprocalscore function This means that the more recent documents will get
a boost in score You may want to adjust the weight parameter to theelement range query to tune how much impact recency has
To use this approach with the REST API, create a range constraintand specify the score-function=reciprocal range option You’llneed to provide an anchor point with the constraint, for instance
your middle tier, combining it with the user inputs
The score-function option can be reversed by specifying the linear function This rewards values that are further away from theanchor value In the case of pubdate, score-function=linear
would favor older documents This could be useful for a contentmanager looking for content that needs to be updated
See Also
• XQuery version: “Sort results to promote recent documents”
• Documentation: “Range Query Scoring Examples” (SearchDeveloper’s Guide)
Weigh Matches Based on Document Parts Problem
When doing a text search, some matches are more valuable thanothers For instance, if you’re searching for a book, a match in an
Weigh Matches Based on Document Parts | 9
Trang 18ISBN field is a sure thing, a match on the title or author is very use‐ful, a match in the abstract is good, and a match in the rest of thetext is a normal hit.
Solution
Applies to MarkLogic versions 7 and higher
Part of the challenge of rewarding matches from different parts of adocument is determining how to weight them To do that, start with
an easily adjustable query, like this one:
let text := ( "databases" )
Trang 19Here’s an updated query to use the field:
let text := ( "databases" )
to see first? Probably the one with the title match, since the title willlikely have key terms in it The summary is a bit bigger, butdescribes the general purpose of the content The rest of the contentmay have lots of terms that are much more broadly related This isthe intuition that drives awarding higher scores to matches in differ‐ent parts of a document
MarkLogic’s cts: queries take a weight parameter The default value
is 1.0, but you can set it in a range from 64 to -16 The higher thevalue, the more points a match earns Since we’re using
Weigh Matches Based on Document Parts | 11
Trang 20cts:element-query, we need to turn on the word-positions and
element-word-positions indexes
The biggest challenge with this scoring is figuring out how to weightthe various parts of the document How much more relevant is amatch in the title than a match in the summary? The answer will beapplication-specific and requires experimentation Setting up an or-query makes it easy to run a set of experiments
Once you have settled on the weights, you can simplify the querycode by creating a field The field will include specification of thepaths and their relative weights
Trang 21CHAPTER 3 Understanding Your Data and
How It Gets Used
MarkLogic provides a platform for storing large amounts of hetero‐geneous data Understanding what a database holds and how yourusers interact with it is key to improving the content over time Thefirst recipe in this chapter shows how to log the searches that yourusers are running Based on this information, you can discover gaps
in your content or see what provides the best draw to your applica‐tion The second recipe analyzes how content is divided amongdirectories, which are likely used to contain logical or physical seg‐ments of your data
Logging Search Requests
Problem
Record searches run by users, in order to build a recommendationsystem, understand user needs, or determine what type of content toadd The goal is to record more information than the access logswould provide, and perhaps to associate it with user profiles
Solution
Applies to MarkLogic versions 7 and higher
There are a variety of ways to implement your search feature If youare using XQuery or JavaScript main modules to provide this
13