Andy Oram Designing a New Search Engine for Data Search-Driven Business Analytics... Andy OramSearch-Driven Business Analytics Designing a New Search Engine for Data... 1 A New Generati
Trang 1Andy Oram
Designing a New Search Engine for Data Search-Driven
Business Analytics
Trang 2Make Data Work
strataconf.com
Presented by O’Reilly and Cloudera, Strata + Hadoop World is where cutting-edge data science and new business fundamentals intersect— and merge.
n Learn business applications of data technologies
nDevelop new skills through trainings and in-depth tutorials
nConnect with an international community of thousands who work with data
Trang 3Andy Oram
Search-Driven Business Analytics
Designing a New Search
Engine for Data
Trang 4[LSI]
Search-Driven Business Analytics
by Andy Oram
Copyright © 2015 O’Reilly Media, Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Shannon Cutt
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest August 2015: First Edition
Revision History for the First Edition
2015-09-02: First Release
2015-10-20: Second Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Search-Driven
Business Analytics, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Trang 5Table of Contents
Search-Driven Business Analytics 1
A New Generation of Vendors Offering Interactive Visualizations 2
Data Access Methods Are Being Transformed by Search 4
Getting Insights from Diverse Data 7
Interpreting User Input 9
Translating Queries into Answers 13
Validating Answers 15
Creating the Simplicity of a Search-Like Query 17
Creating Instant Visualizations 20
Sharing Answers and Visualizations 21
Bringing Search-Driven Analytics to the Masses 22
iii
Trang 7Search-Driven Business Analytics
We are all accustomed to instant results with the use of major websearch engines However, when we pull up a business intelligence(BI) product at work, the situation is quite different In comparison
to Internet services that we use every day, these products seem stiffand unresponsive Business leaders are served with pre-built reportsand dashboards put together by their BI teams, and they wait days
or weeks to get reports on new inquiries about customers, products,
or markets Thus, when a business manager moves from Facebook,Amazon.com, or Google to her BI tool, it feels like time travel back
to a different century
This report examines what it takes to make business intelligence assimple and responsive as today’s consumer search engines, wherethe user gets answers and visualizations as quickly as questionscome to mind
We’ll look at:
• The convergence of BI and search
• What a search-driven user experience looks like
• The intelligence required for analytical search
• Data sources and their associated data modeling requirements
• Turning on-the-fly calculations into visualizations
• Applying enterprise scale and security to search
The techniques described here are general and draw on established practices in the field The main reference platform forthis report is the ThoughtSpot Analytical Search Appliance Theauthor will also incorporate information gleaned from discussions
well-1
Trang 8with technical staff from Microsoft’s Power BI service and from
Adatao, a firm that offers collaborative and predictive analytics
A New Generation of Vendors Offering
Interactive Visualizations
ThoughtSpot’s Analytical Search engine allows the user to ask hoc questions of their data through a search interface The enginecomputes results on-the-fly based on the search query, and offersvisualizations of interest to the user It features an interactive inter‐face that allows you to search through billions of rows and computeresults on-the-fly from any data source
ad-Figure 1 Data display in ThoughtSpot
Microsoft’s PowerBI service lets you quickly create dashboards,share reports, and directly connect to (and incorporate) all the dataavailable within the organization, through partners, or publicly pos‐ted to the Internet Power BI Desktop enables you to transform dataand create reports and visualizations Figure 2 shows a typical dash‐board created in the Desktop
Trang 9Figure 2 Dashboard produced by Microsoft Power BI
Adatao takes a problem-solving approach to all data, big and small,where the user starts with a hypothesis and pulls answers out of datasources to validate or invalidate the hypothesis Figure 3 shows typi‐
cal output from Adatao, known as a narrative, which enables data
discovery and presentation in the form of attractive visualizations
Figure 3 Narrative produced by Adatao
A New Generation of Vendors Offering Interactive Visualizations | 3
Trang 10Data Access Methods Are Being
Transformed by Search
So how have these new-generation technologies transformed datainteraction for the business user? An enlightening analogy can bedrawn between the way managers use BI today and how informationaccess on the Internet has evolved
Typically, a manager at a data-rich company has access to certaincanned business reports The managers have generated a list of busi‐ness questions such as “a chart showing the product revenue fromeach store, to compare same-store sales year-by-year” and a pro‐grammer has dutifully coded up an analytics application to providethose answers If the business managers want a different report con‐taining metrics and relationships not provided ahead of time, arecoding effort is involved This severely limits the data analysis sys‐tems, leaving them unresponsive to intuitive questioning by thebusiness managers The systems and humans are operating at verydifferent paces in this world of old-generation BI software
Drawing an analogy to the evolution of the Internet, this is similar
to the sites that curated content for users more than a decade ago.Users would subscribe to forums to find out what was new Hotproducts like Encarta (introduced by Microsoft in the early 1990swhen the Web was quite young) provided predetermined sets ofinformation in an encyclopedia format Getting access to theseresources was much easier than pacing through the card catalog ofone’s local library, but they opened access only to a limited set ofinformation chosen by the site Existing BI reports are similar tothese offerings in their inelasticity and lack of real-time interactivity
to serve the needs of the business user
The advent of the AltaVista search engine, and subsequently Google,transformed information access The search engines didn’t add a jot
to the information already available But they radically broadenedthe sites to which we had access, and put us only a few seconds and
a few clicks away from the wealth of information and opinions onthe Web Immediate options are now taken for granted as we search
an online bookseller for books, a travel site for hotels and airlinetickets, etc Within minutes we sample a mind-boggling range ofopinions from around the world, whether the subject is the best datastore for fast-moving input or the latest sports news
Trang 11What does it take to bring the same kind of instant feedback andbroad searchability to business intelligence? Some requirementsinclude:
Real-time interactivity
When you start typing “flowers” into a modern search enginesuch as Google or Bing, it anticipates what you want and sug‐gests popular completions, such as “flowers online” and “flowersfor algernon” (a popular book and movie title) Typing “restau‐rants” will probably offer you local results Similarly, a BI solu‐tion should instantly fashion charts or other answers while youare typing, predicting what you want based on its knowledge ofprevious queries and the data sets themselves It should get bet‐ter over time as it learns more about what each user wants andoffer more relevant suggestions
A single, accurate answer
Unlike web search engines that can return multiple results inrelevance-ranked order, the BI interface should return just whatthe user asked for, leaving out extraneous results Ideally, whenthe user wants a simple answer such as “revenue for Californialast year” the interface should return a single figure instead of atable of values the user has to interpret, or a list of links to pastreports or dashboards for the user to sift through to find theanswer
Diverse data sets
The BI solution should be able to use structured data through‐out the organization, from many different databases and evenmore informal sources such as spreadsheets All these sourcesshould be combined smoothly, and the solution should recog‐nize relationships among the columns of databases so that it cancombine this data in visualizations and other results
Trang 12many columns of many tables and still return results in realtime.
in using their corporate credentials
Administrators should be able to set up security for individualusers or for groups, controlling access at the level of a saveddashboard or chart, a column (such as a column in an HR tablethat has compensation data), or a row (customer informationfor the West Coast might be hidden from a sales rep in the EastCoast, for example)
How does a BI solution like this change the way we do business?How does the reduction in response time for a query, from days toseconds, lead to a higher top line and lower costs?
Instead of waiting to see past performance of sales, the general man‐ager of a business unit can see real-time sales performance andmake inventory allocation decisions based on real-time demand.Business processes are undergoing complete disruptions as pre-calculated transformations are now possible on demand
The impact becomes even greater as interfaces are able to anticipatewhat a user wants and bring into sharp focus ideas that are justemerging This anticipation can be based on previous queries—forinstance, if someone searches for information on California, theinterface would check its cached queries and notice similar searchesfor information on New York, then suggest a related result Every‐one has a unique approach to asking questions, so personalizing thesuggestions makes the experience a lot more relevant and user-friendly The interface can also look at the data itself: for instance, ineach column the interface anticipates that the user is likely torequest values that are more commonly found there
Trang 13Getting Insights from Diverse Data
Enterprises’ data sources come in several flavors:
• Data warehouses often store tens, hundreds, or terabytes of his‐torical data in relational tables accessed through SQL
• Applications, both on-premise and in the cloud, produce resultsthat can be input into BI Recent years have seen a notableincrease in cloud enterprise applications offered by vendorssuch as Salesforce and NetSuite
• The ubiquitous spreadsheets spread across desktops and laptopsacross the enterprise that individuals use to analyze subsets ofdata
• With the increasing spread of Hadoop, Spark, and other “bigdata” technologies within the enterprise, data sources with rela‐tively loose document formats are becoming an important cate‐gory as well
The more sources of data a search engine can handle, the more use‐ful it becomes—not only because more of the organization’s data issearchable, but because the different sources can work together andadd extra meaning However, one of the most time-consumingproblems faced by BI analysts is the integration of multiple datasources, especially non-relational data A search-driven interface canhelp with this, by offering a visual and easy way for analysts to dis‐cover bad or stale data, and exclude it from the scope of data that’svisible to business users
Therefore, integrating sources and indexing their content for quickretrieval is the key initial task for interactive BI and analytics TheThoughtSpot Analytical Search Appliance uses a variety of interfaces
to integrate data from various sources:
• Data is loaded from data marts or data warehouses through theenterprises’ chosen ETL tools, and through a JDBC/ODBCinterface that can be used to connect data sources directly toThoughtSpot Data can also be directly loaded into Thought‐Spot through bulk data load scripts These are highly efficient,loading the data at multi-terabyte-per-hour speeds in a scale-outfashion across all the nodes
Getting Insights from Diverse Data | 7
Trang 14• For cloud data sources, in addition to the above options,ThoughtSpot has partnered with vendors to use their individualproducts, such as Informatica’s Cloud Connector, to load data.
• Spreadsheets can be uploaded by individual users through aninterface in the product that guides the user through the pro‐cess As part of that workflow, the user can also specify whethershe wishes to link a column from this spreadsheet to any othercolumn present in the system so that she can analyze local datapresent on her computer against company-wide data from theirdata warehouse
ThoughtSpot understands the underlying schema and relationshipsbetween your data when you load it, so as soon as it is loaded, it isready to be searched without any additional modeling work Thesystem also works across any time granularity—weekly, quarterly,yearly—without requiring the BI team to build new aggregate tables,OLAP cubes, and materialized views This helps business users tostart using the system as soon as the IT/BI team has loaded data into
it And as the user types queries that connect multiple tablestogether, the multiple join path choices are all handled under thehood so the user does not have to know any SQL terminology toconnect diverse data sets together and complete her query Thought‐Spot is able to provide sub-second response times for searches overbillions of rows of data because of its purpose-built, in-memoryrelational cache This cache understands search semantics and secu‐rity rules, as well as query plans, and is able to scale out across hun‐dreds of nodes
Once the data is loaded, ThoughtSpot creates an index to maximizethe speed of queries For data volumes in terabytes, the index needs
to be efficiently sharded and distributed across multiple nodeswithout compromising on search latency The creation of the indexitself must be distributed so that there is minimal delay betweenwhen new data shows up in the system and when it is ready to besearched
Microsoft’s Power BI features integration with external tools, bothfrom Microsoft and from partners such as Salesforce and Zendesk.The Power BI interface helps the user find these resources—databa‐ses, spreadsheets, Hadoop data stores, even social media sites—andconnect to them A relational database provides its own schema,whereas Power BI creates the schema for a spreadsheet, normallyusing the first row as column names Figure 4 shows an entity-