Ten Signs of a Mature DataScience Capability If you want to build a ship, don’t drum up people to collect wood, and don’t assign them tasks and work, but rather teach them to long for th
Trang 2name of event
Trang 4Ten Signs of Data Science
MaturityPeter Guerra and Kirk Borne
Trang 5Ten Signs of Data Science Maturity
by Peter Guerra and Kirk Borne
Copyright © 2016 O’Reilly Media, Inc All rights reserved
Printed in the United States of America
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,Sebastopol, CA 95472
O’Reilly books may be purchased for educational, business, or salespromotional use Online editions are also available for most titles(http://safaribooksonline.com) For more information, contact ourcorporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor: Tim McGovern
Production Editor: Melanie Yarbrough
Copyeditor: Melanie Yarbrough
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest
February 2016: First Edition
Trang 6Revision History for the First Edition
2016-03-07: First Release
Cover photo: Olafur Eliasson’s glass front by tristanf
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Ten
Signs of Data Science Maturity and related trade dress are trademarks of
O’Reilly Media, Inc
While the publisher and the authors have used good faith efforts to ensurethat the information and instructions contained in this work are accurate, thepublisher and the authors disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use
of or reliance on this work Use of the information and instructions contained
in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights
978-1-491-95252-8
[LSI]
Trang 7Ten Signs of a Mature Data
Science Capability
If you want to build a ship,
don’t drum up people to collect wood,
and don’t assign them tasks and work,
but rather teach them to long for the endless
immensity of the sea
Antoine de Saint-Exupéry
Over the years in working with US government, commercial, and
international organizations, we have had the privilege of helping our clientsdesign and build a data science capability to support and drive their missions.These missions have included improving health, defending the nation,
improving energy distribution, serving citizens and veterans better,
improving pharmaceutical discovery, and more
Often, our engagements have turned into exercises in transforming how theorganization operates — “building a capability” means building a culture tosupport and make the most of data science In many cases, this culture changehas delivered significant insights into big challenges the world faces —
poverty, disease outbreaks, ocean health, and so forth We have encountered
a wide variety of successful organizational structures, skill levels,
technologies, and algorithmic patterns
Based on those experiences, we share here our perspective on how to assesswhether the data science capability that you are developing within your ownorganization is achieving maturity In no particular order, here are our top tencharacteristics of a mature data science capability
Trang 8A mature data science organization…
Trang 91 …democratizes all data and data access.
Let’s make one thing clear from the start: Silos suck! Most organizationsearly on in the data-science learning curve spend most of their time
assembling data and not analyzing it Mature data science organizations
realize that in order to be successful they must enable their members to
access and use all available data — not some of the data, not a subset, not a
sample, but all data A lawyer wouldn’t go to court with only some of the
evidence to support their case — they would go with all appropriate
evidence Similarly, mature data science organizations use all of their data tounderstand their business domain, needs, and performance Successful
organizations take the time to understand all the data they collect, to
understand its uses and content, and to allow easy access
Some recent articles have suggested that big data and data science are
mutually exclusive: Focusing on increasing data-gathering (“big data”)
comes at the expense of quality analysis (“data science”) We disagree Theyare mutually conducive to discovery, data-driven decision-making, and bigreturn on analytics innovation Big data isn’t about the volume of data nearly
as much as it is about “all data” — stitching diverse data sources together innew and interesting ways that facilitate data science exploration and
exploitation of all data sources for powerful predictive and prescriptive
analysis You can’t have mature data science without democratizing access toall data That means standardizing metadata, access protocols, and discoverymechanisms You aren’t mature until you have done that for all data
Here is where cultural incentives are so important We’ve seen too manyorganizations that still use data as power levers: we hear that we can’t getdata because a single person is the data steward and access has to be
controlled Governance is essential, but it can’t be a pretext for one person orgroup maintaining power by controlling access to data Let go, and let datadiscovery and innovation begin!
Trang 102 …uses Agile for everything and leverages DataOps (i.e., DevOps for Data Product Development).
Some traditional organizations are stuck in older ways of managing processesand development If your IT and development departments are asking forrequirements and expect to deliver a year or more out, then you may be
experiencing this These organizations are resistant to change —
consequently, requests for new tools and methods go before review boardsand endless architecture/design committees to justify the expenditure Often,
a large effort will be funded simply to study whether the proposed solutionwill work Other times, a committee will decide which analytic problems arethe most pressing Paralysis of analysis must be broken in order to achievedata science maturity and success Bureaucracy doesn’t work well in science,and it doesn’t work in data science either Science celebrates exploratory,agile, fast-fail experimental design (see “7 …celebrates a fast-fail
collaborative culture.”)
Just as Agile development has championed user stories and short iterationsover long drawn-out requirements and delayed delivery, Agile data sciencerequires both close collaboration within the business and the freedom to
experiment Agile is not a software development methodology, it is a
mindset It permeates all levels of the mature organization When was the lasttime your CEO or senior manager held a retrospective or SCRUM meeting?Understanding how to promote a flexible culture, organization, and
technology that work together can be challenging, but immensely rewardingbecause of the collaboration and creativity it cultivates
An agile DevOps methodology for data product development is critical — wecall this DataOps DataOps works on the same principles as DevOps: tightcollaboration between product developers and the operational end users; clearand concise requirements gathering and analysis rounds; shorter iterationcycles on product releases (including successes and fast-fail opportunities);faster time to market; better definition of your MVP (Minimum Viable
Product) for quick wins with lower product failure rates; and generally
creating a dynamic, engaging team atmosphere across the organization In
Trang 11addition to these general Agile characteristics, DataOps accelerates currentdata analytics capabilities, naturally exploits new fast data architectures (such
as schema-on-read data lakes), and enables previously impossible analytics.With a sharpened focus on each MVP and the corresponding SCRUM
sprints, DataOps minimizes team downtime from both lengthy review cyclesand the costs of cognitive switching between different projects
Mature data science capability reaches its full potential in an agile DataOpsenvironment
Trang 123 …leverages the crowd and works collaboratively with businesses (i.e., data champions, hackathons, etc.).
Data science groups that live in a bubble are missing out on the best
community out there Activities that promote data science for social good,including open or internal competitions (like Kaggle), are a great way to
sharpen skills, learn new ones, or just generally collaborate with other parts
of the business
In addition, mature data science teams don’t try to go at it alone, but insteadwork collaboratively with the rest of the organization One successful tactic issponsoring internal data science competitions, which are great for team
building and integration The mature data science organization has a
collaborative culture in which the data science team works side by side withthe business to solve critical problems using data
Another approach is internal crowdsourcing (within your organization) —this is particularly strong for surfacing the best questions for data scientists totackle The mature data science capability crowdsources internally severaldifferent tasks in the data science process lifecycle, including data selection;data cleaning; data preparation and transformations; ensemble model
generation; model evaluation; and hypothesis refinement (see “4 …followsrigorous scientific methodology (i.e., measured, experimental, disciplined,iterative, refining hypotheses as needed).”) Since data cleaning and
preparation can easily consume 50–80% of a project’s entire effort, you canaccrue significant project time savings and risk reduction by parallelizing(through crowdsourcing) those cleaning and preparation efforts, especially bycrowdsourcing to those parts of the organization that are most familiar withparticular data products and databases
Also, algorithms don’t solve all problems It is still incredibly difficult for analgorithm to understand all possible contexts of an outcome and pick the rightone Humans must be in the loop still, and a deep understanding of the
context of the challenge is essential to solid interpretation of data and creatingaccurate models
Trang 134 …follows rigorous scientific methodology (i.e.,
measured, experimental, disciplined, iterative, refining hypotheses as needed).
Exploratory and undisciplined are not compatible Data science must be
disciplined That does not mean constrained, unimaginative, or bureaucratic.Some organizations hire a few data scientists and sit them in cubes and
expect instant results In other cases, the data scientists work within the ITorganization that is focused on operations, not discovery and innovation.Mature data science capability is built on the foundation of the scientificmethod First, make observations (i.e., collect data on the objects, events, andprocesses that affect your business) — collect data in order to understandyour business by embedding measurement systems or processes (or people)
at appropriate places in your business workflow Think of interesting
questions to explore, and then formulate testable hypotheses with your
business partners Once you have a good set of questions and hypotheses,then test them — analyze data, develop a data science model, or design a newalgorithm to validate each hypothesis, or else refine the hypothesis and
iterate This methodology will ensure that value is created when formal
scientific rigor is applied That’s an undeniable sign of mature data sciencecapability
A key part of the scientific process is knowing the limits of your sample.Looking for and testing for selection bias is key Similarly, it is important tounderstand that “big data” does not spell the end to incomplete samples
(unfair sampling) or sample variance (natural diversity)
Trang 145 …attracts and retains diverse participants, and grants them freedom to explore.
The key word is diverse What fun is a bunch of math nerds? (Three
statisticians go out hunting together After a while they spot a solitary rabbit.The first statistician takes aim and overshoots the rabbit by one meter Thesecond aims and undershoots it by one meter The third shouts out “We gotit!”) Some organizations are looking for data scientists who are great coders,who also understand and apply complex applied mathematics, who know alot about the specific business domain, and who can communicate with allstakeholders One or two such people may exist — we call them purple
unicorns Mature organizations recognize that data science is a team sport,with each member contributing valuable unique skills and points of view.Among those skills and competencies are these: Advanced Database/DataManagement & Data Structures; Smart Metadata for Indexing, Search, &Retrieval; Data Mining (Machine Learning) and Analytics (KDD =
Knowledge Discovery from Data); Statistics and Statistical Programming;Data & Information Visualization; Network Analysis and Graph Mining(everything is a graph!); Semantics (Natural Language Processing,
Ontologies); Data-intensive Computing (e.g., Hadoop, Spark, Cloud, etc.);Modeling & Simulation (computational data science); and Domain-SpecificData Analysis Tools
But don’t think that every person must have at least one of those technicalskills at the outset — some of the best data science organizations grow thoseskillsets from within, by identifying the core aptitudes among their currentstaff that lead to data science success (even within nontechnology trainedstaff) Those core aptitudes include the 10 C’s: curiosity (inquisitive),
creativity (innovative), communicative, collaborative, courageous solver, commitment to life-long learning, consultative (can-do, will-do
problem-attitude), cool under pressure (persistence, resilience, adaptability, and
ambiguity tolerance), computational, and critical thinker (objective analyzer).Diverse perspectives are beneficial on multiple fronts They make the
Trang 15questions more interesting, but more importantly they make the answers evenmore interesting, useful, and informative Answers are given greater contextthat can yield greater impact Mature data science capability understands thatyou need more than just math or computer science folks on projects Themature organization integrates business experts, SMEs, “data storytellers”,and creative “data artists” seamlessly, and then grants them the freedom toexplore and exploit the full power of their data assets The output from suchdiverse teams will be richer than that from any purple unicorn And
remember, it is better to have both a horse and a narwhal than a unicorn!
Trang 166 …relentlessly asks the right questions, and constantly searches for the next one.
The fundamental building block of a successful and mature data science
capability is the ability to ask the right types of questions of the data This isrooted in the understanding of how the business runs or how any businesschallenge manifests itself The best data science team covers all the aptituderequirements mentioned earlier (see “5 …attracts and retains diverse
participants, and grants them freedom to explore.”): curious, creative,
communicative, collaborative, courageous problem solvers, life-long learner,doer, and resilient
Mature data science capability is exemplified in the relentless pursuit of newquestions to ask (even questions that could never be answered before) and inasking questions of the questions! Data science maturity frees the
organization to ask the hard questions across the entirety of the business, isdisciplined in how it asks those questions, and is not afraid of getting the
question to ask of your data (at the right time, in the right context, for theright use case) This “cognitive” ability to come up with not just the rightanswers but with the right questions (especially questions that were neverasked or considered before) is the highest level of both analytics maturity anddata science capability maturity As the adage says: “The only bad question isthe one that you don’t ask.”