1 Overview of Trends Driving Embedded Analytics 3 The Impact of Trends on Embedded Analytics 6 Modern Applications of Big Data 7 Considerations for Embedding Visual Analytics into Modern
Trang 1Federico Castanedo
& Andy Oram
A Product Manager’s Guide to
Integrating Contextual Analytics
Complimentsof
Trang 3Federico Castanedo and Andy Oram
Delivering Embedded Analytics in Modern
Applications
A Product Manager’s Guide to Integrating Contextual Analytics
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing
Trang 4[LSI]
Delivering Embedded Analytics in Modern Applications
by Federico Castanedo and Andy Oram
Copyright © 2017 O’Reilly Media, Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Nicole Tache
Production Editor: Nicholas Adams
Copyeditor: Rachel Monaghan
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest May 2017: First Edition
Revision History for the First Edition
2017-04-25: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Delivering Embed‐
ded Analytics in Modern Applications, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Trang 5Table of Contents
1 Delivering Embedded Analytics in Modern Applications 1
Overview of Trends Driving Embedded Analytics 3
The Impact of Trends on Embedded Analytics 6
Modern Applications of Big Data 7
Considerations for Embedding Visual Analytics into Modern Data Environments 15
Deep Dive: Visualizations 23
Conclusion 28
A Self-Assessment Rating Chart 29
iii
Trang 7in business applications With embedded analytics, organizationsleverage vendors’ domain expertise to provide analytics guidance forthe application users.
The trade press has recently focused on helping businesses become
“data-driven organizations.” As the result of a study, the Harvard Business Review bluntly announced, “The more companies charac‐terized themselves as data-driven, the better they performed onobjective measures of financial and operational results” and “Theevidence is clear: data-driven decisions tend to be better decisions.”The article follows up with specific examples Infoworld stresses
speed, automation of data collection, and independent thinkingamong staff The barriers to becoming data-driven are the focus of a
recent Forbes article; ways forward include a compelling justificationfor the data and a central organization capable of handling it, adding
1
Trang 8up to a “data-driven culture.” McKinsey (which helped conduct theHBR study) also stresses organizational transformation, whichinvolves learning analytical skills and integrating data in decisionmaking.
The trends and research all point to a single conclusion: organiza‐tions, customers, and employees seek a data-driven environmentand they value those applications that make it easier to make sense
of all the data that is available for decision making This reportexamines the architecture and characteristics that allow softwarevendors and developers to meet this need and increase the value oftheir application by using embedding analytics
What are the basic requirements of embedded analytics on amodern data platform? Essentially, to provide speed-of-thoughtinteraction with powerful visualizations that help knowledge work‐ers take action on data While data comes from multiple sources(some of it historical, some of it streaming in real time) and is storedwith various technologies for cost efficiencies, all these complexitiesmust be managed by the visual platform, delivering visual analytics
in a wide variety of formats and devices The following list summa‐rizes the requirements of an embedded analytics tool on a moderndata platform:
• Integrate with the modern data architecture by accepting datafrom a wide variety of input sources, where each may have dif‐ferent data types
• Process large amounts of data quickly, and respond to interac‐tive requests within seconds
• Embed easily into web pages or other media browsers, includ‐ing applications created by third-party developers
• Scale up automatically and have the ability to process streamingdata
• Adhere to security restrictions, so users see only the data towhich they are supposed to have access
To begin, let’s examine in more detail the trends driving embeddedanalytics
Trang 9Overview of Trends Driving Embedded
Analytics
For application vendors, embedding modern analytics is an oppor‐tunity to provide added value, increasing customer stickiness andrevenue For organizations, it is an opportunity to leap forward intheir data-driven initiatives These initiatives are critical for compet‐itive advantage and rely heavily on the following trends:
Speed of change (velocity)
In business, finance, and technology, trends that used to occurover weeks or months now take place within minutes and must
be responded to as fast as possible Until recently, companiescould get by with checking data every few months and changingtheir strategies a couple times a year But now, consumers andclients can get news within minutes of it happening, and changetheir preferences based on what’s posted to the internet
Rollouts of new products are traditionally spaced across seasons(spring, summer, etc.) But nowadays, a trend can catch fire in amatter of days, thanks to the speed at which informationspreads through social media Bad news travels even faster—ifthere is a problem such as when the Galaxy Note 7 phonestarted to catch fire, markets shift instantly
As the pace of change increases, having the right data and ana‐lyzing it quickly to make decisions is key to the organization’sability to switch directions much more rapidly
Knowledge workers
Although the term knowledge work first appeared in 1959, atthat time the people whose decisions were crucial to organiza‐tional success were stuck at their desks, perhaps in isolated offi‐ces with nothing but a telephone and a pile of journals or stacks
of mainframe printouts connecting them to information It tookthe arrival of the internet era to provide knowledge workerswith continuous streams of data about the outside world, aresource particularly exploited by young people in the work‐force These people know just as much about what is happeningamong clients, suppliers, and competitors as what is happeningwithin their own organization And they are speaking up todemand access at speed-of-thought response times—seconds,not minutes or hours
Overview of Trends Driving Embedded Analytics | 3
Trang 10Availability of data
Sources of data that were unimaginable to earlier generationsare now commonly used Today companies can enhance theirinternal data with social, weather, credit history, census data,and a wide range of data sets that are readily available Forexample, these data sets may include the dutifully logged behav‐ior of web visitors, real-time updates on inventory and sales inretail stores, and the terabytes of data streaming in from sensors
in the field
While data may be more available, availability is only the firststep in making data useful and impactful to the organization.Data becomes relevant only when it is utilized—that is, whenpeople act or make decisions based on it The challenge of mak‐ing data useful to drive actions is even harder with so much dataavailable One reason for this is that it is necessary to combinemultiple sources of information (such as customer interactions,real-time transactions, social communication, and locationdata) to obtain insights that were not available before Anotherreason is that understanding the data enough to explore itrequires domain expertise, and users may easily get lost whendealing with large amounts of data Free exploration can be agift or waste of time Finally, since some data may be sensitive, it
is also important to restrict access to only authorized users
Lowering costs
Decreasing costs of technology and its components—especiallystorage and hardware—allow organizations to do more withless Organizations typically archive data to tape storage, which
is fine for emergency recovery from system failures, but doesnot allow instant access or fine-tuned queries The need formassive storage due to the large amounts of data collected, andthe falling cost of disk storage, means organizations are nowkeeping billions of records within reach, contributing to thedata explosion revolution
As memory also shrinks in size, gets cheaper, and is distributedacross clusters of commodity hardware, calculations that used
to suffer from slow disk access can be carried out in primarymemory Massive data sets are now stored in memory, allowinglightning-fast random access Analytical tools can also run in-memory on clusters of low-cost servers to process real-timestreams in a timely manner
Trang 11Cloud/SaaS platforms allow organizations to further lower costswhile also benefiting from flexibility, and the agility to scale Forinstance, if a product line takes off or a sudden event bringswidespread attention to an organization, it needs to quicklyrespond to the change by ramping up the resources for a serviceand scaling accordingly Cloud platforms are selected for theircost effectiveness as well as the option to offload the administra‐tive hassles of reliability, security, and constant growth to a thirdparty.
More and more startups turn to cloud services exclusively as astrategy to access large amounts of compute and storage resour‐ces without incurring high upfront hardware costs Even someestablished companies like Netflix have taken operations to thecloud to increase their computing power and flexibility
Customer 360 insights
Customer 360 has become a popular term to describe an ideal
situation where an organization gains competitive advantages
by combining all the information available about the customer,delivering a holistic view that is accessible to all parts of theorganization: sales, finance, services, marketing, and so on.Nowadays, customers are tracked through multiple aspects oftheir daily lives Much of their data is surrendered voluntarilythrough channels such as customer loyalty cards and socialmedia “likes.” The need to analyze customer interactions holisti‐
cally is known as customer journey analytics, and is based on
collecting information about every step of the customer’s expe‐rience from first contact through purchase, and determiningfrom analytics how to make the process more likely to lead to asale Customer journey analytics is very important to onlinechannels where advertisers need to decide their investmentstrategies based on where the sales are coming from Organiza‐tions that use this data at key points of interaction with custom‐ers—for example, next best recommendations such as “Whatmovie do we recommend next?”—have a clear competitiveadvantage fueling their growth
Technological advances
In the past 10 to 15 years we have seen a wide range of new tools(many of them open source) attempting to meet the needs ofdata-driven organizations The rapid change in the technology
Overview of Trends Driving Embedded Analytics | 5
Trang 12to collect and process data at massive scale has become a trend
on its own
New types of databases have revolutionized the field, whichcontinues to see significant change These tools include data‐bases (such as Kudu, HBase, Cassandra, MongoDB, and otherproducts loosely known as NoSQL), data processing frame‐works (such as Hadoop and Spark), query engines (such asApache Impala, Apache Hive, and Presto), stream processingtools (such as Storm, Kafka, Flume, Apache NiFi, AmazonKinesis, and Apex), and text indexing (such as ElasticSearch,Solr, and Cloudera Search)
In addition, access to data may be facilitated by cloud providers,which centralize and feed data to analytical tools Cloud provid‐ers have optimized many of the underlying infrastructurerequired for storing and querying Big Data, by offering scalableand cost-effective datastores such as Amazon’s Redshift, Goo‐gle’s BigQuery and Cloud Spanner, and Microsoft’s HDInsight.Both the quick pace at which data requirements change and thepressure to lower costs drive organizations to experiment withthese new technologies and adopt them at a higher rate thanpreviously seen
The Impact of Trends on Embedded Analytics
The preceding trends are driving embedded analytics For instance,
to meet knowledge workers’ demand for data at the speed ofthought, organizations have adopted visual analytics platforms.Business users have grown accustomed to dashboards, reports, andfree exploration of data sets But as data grows in complexity andsize, business users require more guidance in order to turn Big Datainto insights with ease
Embedded analytics is the use of reporting and analytic capabilitieswithin business applications It is a technology designed to makedata analysis and business intelligence (BI) more accessible to users.Embedded analytics is easily accessible from inside the application’sworkflow without forcing users to switch to a separate standalone BItool
Because not everyone in the organization can become astute withdata, the use of embedded analytics provides guidelines to help busi‐
Trang 13ness users understand data quicker With the complexity of data (asdiscussed earlier), software application vendors are in a unique posi‐tion to help organizations, today more than ever, by adding theguidelines that help business users explore data more efficiently.They do so by infusing their domain expertise and packaging theanalytics most relevant for the particular business processes man‐aged by the application.
For example, advanced visualization frameworks such as Zoomdatacan embed visual analytics into any application connected tomodern technologies With such advanced frameworks, knowledgeworkers can turn data into actions in any environment, from webbrowsers to touch-oriented mobile devices, by interacting with intu‐itive visualizations and dashboards
In the next section we will show use cases of modern applicationsusing Big Data
Modern Applications of Big Data
Modern decision making rests on a phenomenon popularly known
as Big Data, defined by Wikipedia as data sets that are so large orcomplex that traditional data processing application software isinadequate to deal with them Challenges include capture, storage,analysis, data curation, search, sharing, transfer, visualization,querying, updating, and information privacy Scientists, businessexecutives, practitioners of medicine, advertisers, and governmentsalike regularly encounter difficulties with large data sets in areasincluding internet search, finance, and urban and business infor‐matics
Data sets grow rapidly—in part because they are increasingly gath‐ered by cheap and numerous information-sensing mobile devices,aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor net‐works
Big Data “size” is a constantly moving target, as of 2012 rangingfrom a few dozen terabytes to many petabytes of data According toresearch firm Gartner, “Big Data represents the information assetscharacterized by such a high volume, velocity, and variety to requirespecific technology and analytical methods for its transformationinto value.” It requires a set of techniques and technologies with new
Modern Applications of Big Data | 7
Trang 14forms of integration to reveal insights from data sets that are diverse,complex, and of a massive scale.
Following on from Gartner’s definition, Big Data applications are
usually defined by the three Vs: the amount of data they use (vol‐
ume), the velocity of data, and the variety of input data being used.
Having a huge amount of data is inevitable when you have a lot ofdifferent possible combinations for a specific task—for instance,when you have a lot of independent users/customers/interactionsand you want to make decisions about them In this scenario, if youtake any random pair of users/customers they will most likely bevery different, but having enough users and enough data about themallows the system to extract useful similarities
Having a lot of data is only valuable when you can easily ask goodquestions and retrieve insight from the data Most of the time, visu‐alizing data—presenting it in a pictorial or graphical format—is thequickest form of interacting with it in order to answer these ques‐tions In fact, visualizing data is the missing link between collectingdata and understanding it With visual access, users can digest hugeamount of data easily As an example of how visualization is impor‐tant to draw the correct conclusions about data, consider Anscom‐be’s famous data set, depicted in Figure 1-1
Trang 15Figure 1-1 The four data sets defined by Francis Anscombe Source: Francis J Anscombe, “Graphs in Statistical Analysis,” American Statis‐ tician 27 (1973): 17–21.
Anscombe’s data set comprises four different sets that have nearlyidentical mean, variance, correlation, and linear regression, yetappear very different when graphed This example demonstrates theadded value of visualizing data rather than just analyzing it Theproblem is even more challenging when you need to analyze mas‐sive amounts of data that arrive in a continuous stream Proper visu‐alization reduces the risk of making bad decisions
Acquiring interesting data itself is very difficult; that’s why compa‐nies commonly buy other companies just for their customer data‐bases and information Microsoft took the practice to a new level ofdata intensity when it paid a 50% premium on LinkedIn’s stock topurchase the company The asset driving this bounty, observersnoted, was LinkedIn’s vast database of clients with detailed profes‐sional information Not only were these clients a useful resource inthemselves, they were also raw input to Microsoft’s machine learn‐ing algorithms
Modern Applications of Big Data | 9
Trang 16Industries and Use Cases Ripe for Modern Data-Driven Solutions
Let’s take a look at the pressures you are likely to face in your orga‐nization, by examining a couple of companies with especially hugerequirements for Big Data Then, we will explore several use casesfor modern data-driven solutions
Retail: The TJX Corporation
There is no doubt that data is important to the TJX corporation,which runs TJ Maxx and many other stores Here we’ll speculate abit about data’s relevance TJX is perhaps the fastest-growing retailer
in the US You probably don’t have the size and market pressuresthat TJX faces now, but you might aspire to that status
TJ Maxx and related stores are positioned to be low-priced in acrowded market, and they depend on quick turnaround and effi‐ciency to thrive They provide a sterling example of responsiveness
to consumer needs in the modern era
Let’s try to estimate the volume and velocity of data handled by TJX.
The corporation reached 3,395 stores in 2015 Because it added 176stores in 2014, we’ll use the figure of 3,219 stores to estimate its datarequirements in 2014
An insightful story in Fortune suggests that TJX shipped 2 billionunits in 2014 That’s a lot of volume To indicate velocity, considerthe statement “Former employees say that the stuff moves so rapidlythat merchandise is often sold before TJX has paid its vendors forit.” TJX’s ability to make quick decisions is evident in another quotefrom the article: “Insiders say the ability to contract on the fly withmanufacturers lets TJX offer customers at least some merchandise
in a hot fashion trend (say, crop tops or slide sandals) when it can’tget enough brand-name supply.”
Starting with that crude estimate of 2 billion units per year, we arrive
at an average of about 5.5 million units shipped each day, or 1,702units per store per day Let’s assume, to keep things simple, that astore is open from 8 AM to 8 PM, for a 12-hour day With thatassumption, TJX ships 7,610 units per minute, or 2 units per minuteper store
What about the third V, variety? TJX succeeds “by selling blouses…
pots and pans…and bedding, sunglasses, sriracha seasoning, yoga
Trang 17mats, and the occasional $1,250 Stella McCartney dress.” The range
of sales between a $20 belt (a usual TJ Maxx purchase) and a $1,250dress could challenge any decision maker
By doing analytics on products, store reviews, or call center logs,what could a TJX executive do with this data? Suppose she wants tojudge the impact of selling that Stella McCartney dress How doesshe compare it to other dresses sold by her stores at the same time?She has to filter sales information to find all dresses sold, limitingdata to dresses with a certain price comparable to the Stella McCart‐ney dress She may want to get a report for each individual store, orgroup the stores by zip code to get a sense of the dress buyers’ dem‐ographics Thus, queries by managers will home in on types of mer‐chandise, price ranges, and time intervals A manager may want achart showing sales of particular items or classes of items at holidayseason each year She may also want to compare sales at stores thatoffered the Stella McCartney dress at a reduced price versus thosethat did not During a sale, the manager may want to view real-timedata about particular types of merchandise
Thus, both historical data and real-time data should be available forprocessing, with support for filtering multiple dimensions (time,price, etc.) and for comparing the data along multiple dimensions
Direct service: Kaiser Permanente
Let’s turn now to a very different organization: a major health pro‐vider interested in reducing costs while maintaining a high quality
of care Kaiser Permanente is one of the largest not-for-profit healthplans in the US, serving more than 11.3 million members It has 38hospitals and 626 outpatient facilities
Although data on patient interactions is hard to find, some relevantdata from 2007 were reported In that year, Kaiser members inHawaii made an average of 3.7 office visits per year, 1.68 telephonecontacts, and 0.23 secure text messages
If the average Kaiser patient has these 5.61 interactions per year,Kaiser as a whole experiences more than 63 million interactionswith patients per year, or 173,679 per day Kaiser is trying to shiftinteractions from office visits to less expensive phone and text mes‐saging contacts, but overall, the volume of all these things is likely toincrease as the organization pursues the health care field’s overarch‐ing goal of more patient engagement
Modern Applications of Big Data | 11
Trang 18What would a Kaiser manager look for in the data? Patients sufferfrom multiple medical conditions that sometimes interact (forinstance, obesity can contribute to knee problems), so Kaiser’s vol‐ume and velocity of data is matched by variety In addition to rela‐tively structured data on billing and tests, crucial information isstored in free-text notes Surveying all this data, a manager mightwant to know the medical conditions associated with the most visits,whether an increase in interactions (such as text messages) is corre‐lated with medical improvements, and whether an increase in onetype of interaction leads to an increase or a decrease in other types.Real-time data may be crucial to management as well, such as when
a suspected epidemic strikes and managers want to know what geo‐graphic regions are affected
Pharmaceutical research and development
Pharmaceutical research and development (R&D) organizationsmake heavy use of analytics to find new drugs and modify treat‐ments Input to the process may consist of billions of rows of data.With huge data sets and multiple sources, data crunching can takehours with a traditional BI environment, but in order to work effec‐tively visual analytics needs to respond in seconds Achieving thisreduction in time lag will allow significant improvements in theservices of any pharmaceutical R&D organization Fast responsetime could extend the use of analytics from the research lab to thefront line For instance, armed with responsive and interactive visualanalytics, their salespeople could visualize for each doctor the his‐tory and current state of treatment outcomes, drilling down tonation, state, and their own patients The benefit goes beyond salesefficiency, as it impacts health care overall, leading to more effectivetreatments
Insurance: Markerstudy
Markerstudy, a leading UK insurance company, is using a Big Dataplatform across key areas of operation to get 360-degree customerinsights, achieving a 120% increase in policy counts in 18 months.Matching the right price to an insurance quote is critical; one mustweigh the risks of offering premiums too high to compete againstthose of offering a price too low to be profitable Efficient quotesrely on a huge number of factors, such as driving records and credithistory Online insurance providers may be offering millions of