1. Trang chủ
  2. » Công Nghệ Thông Tin

Ten signs of data science maturity

25 19 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 25
Dung lượng 2,79 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Ten Signs of a Mature DataScience Capability If you want to build a ship, don’t drum up people to collect wood, and don’t assign them tasks and work, but rather teach them to long for th

Trang 2

name of event

Trang 4

Ten Signs of Data Science

MaturityPeter Guerra and Kirk Borne

Trang 5

Ten Signs of Data Science Maturity

by Peter Guerra and Kirk Borne

Copyright © 2016 O’Reilly Media, Inc All rights reserved

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or salespromotional use Online editions are also available for most titles(http://safaribooksonline.com) For more information, contact ourcorporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Tim McGovern

Production Editor: Melanie Yarbrough

Copyeditor: Melanie Yarbrough

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest

February 2016: First Edition

Trang 6

Revision History for the First Edition

2016-03-07: First Release

Cover photo: Olafur Eliasson’s glass front by tristanf

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Ten

Signs of Data Science Maturity and related trade dress are trademarks of

O’Reilly Media, Inc

While the publisher and the authors have used good faith efforts to ensurethat the information and instructions contained in this work are accurate, thepublisher and the authors disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use

of or reliance on this work Use of the information and instructions contained

in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights

978-1-491-95252-8

[LSI]

Trang 7

Ten Signs of a Mature Data

Science Capability

If you want to build a ship,

don’t drum up people to collect wood,

and don’t assign them tasks and work,

but rather teach them to long for the endless

immensity of the sea

Antoine de Saint-Exupéry

Over the years in working with US government, commercial, and

international organizations, we have had the privilege of helping our clientsdesign and build a data science capability to support and drive their missions.These missions have included improving health, defending the nation,

improving energy distribution, serving citizens and veterans better,

improving pharmaceutical discovery, and more

Often, our engagements have turned into exercises in transforming how theorganization operates — “building a capability” means building a culture tosupport and make the most of data science In many cases, this culture changehas delivered significant insights into big challenges the world faces —

poverty, disease outbreaks, ocean health, and so forth We have encountered

a wide variety of successful organizational structures, skill levels,

technologies, and algorithmic patterns

Based on those experiences, we share here our perspective on how to assesswhether the data science capability that you are developing within your ownorganization is achieving maturity In no particular order, here are our top tencharacteristics of a mature data science capability

Trang 8

A mature data science organization…

Trang 9

1 …democratizes all data and data access.

Let’s make one thing clear from the start: Silos suck! Most organizationsearly on in the data-science learning curve spend most of their time

assembling data and not analyzing it Mature data science organizations

realize that in order to be successful they must enable their members to

access and use all available data — not some of the data, not a subset, not a

sample, but all data A lawyer wouldn’t go to court with only some of the

evidence to support their case — they would go with all appropriate

evidence Similarly, mature data science organizations use all of their data tounderstand their business domain, needs, and performance Successful

organizations take the time to understand all the data they collect, to

understand its uses and content, and to allow easy access

Some recent articles have suggested that big data and data science are

mutually exclusive: Focusing on increasing data-gathering (“big data”)

comes at the expense of quality analysis (“data science”) We disagree Theyare mutually conducive to discovery, data-driven decision-making, and bigreturn on analytics innovation Big data isn’t about the volume of data nearly

as much as it is about “all data” — stitching diverse data sources together innew and interesting ways that facilitate data science exploration and

exploitation of all data sources for powerful predictive and prescriptive

analysis You can’t have mature data science without democratizing access toall data That means standardizing metadata, access protocols, and discoverymechanisms You aren’t mature until you have done that for all data

Here is where cultural incentives are so important We’ve seen too manyorganizations that still use data as power levers: we hear that we can’t getdata because a single person is the data steward and access has to be

controlled Governance is essential, but it can’t be a pretext for one person orgroup maintaining power by controlling access to data Let go, and let datadiscovery and innovation begin!

Trang 10

2 …uses Agile for everything and leverages DataOps (i.e., DevOps for Data Product Development).

Some traditional organizations are stuck in older ways of managing processesand development If your IT and development departments are asking forrequirements and expect to deliver a year or more out, then you may be

experiencing this These organizations are resistant to change —

consequently, requests for new tools and methods go before review boardsand endless architecture/design committees to justify the expenditure Often,

a large effort will be funded simply to study whether the proposed solutionwill work Other times, a committee will decide which analytic problems arethe most pressing Paralysis of analysis must be broken in order to achievedata science maturity and success Bureaucracy doesn’t work well in science,and it doesn’t work in data science either Science celebrates exploratory,agile, fast-fail experimental design (see “7 …celebrates a fast-fail

collaborative culture.”)

Just as Agile development has championed user stories and short iterationsover long drawn-out requirements and delayed delivery, Agile data sciencerequires both close collaboration within the business and the freedom to

experiment Agile is not a software development methodology, it is a

mindset It permeates all levels of the mature organization When was the lasttime your CEO or senior manager held a retrospective or SCRUM meeting?Understanding how to promote a flexible culture, organization, and

technology that work together can be challenging, but immensely rewardingbecause of the collaboration and creativity it cultivates

An agile DevOps methodology for data product development is critical — wecall this DataOps DataOps works on the same principles as DevOps: tightcollaboration between product developers and the operational end users; clearand concise requirements gathering and analysis rounds; shorter iterationcycles on product releases (including successes and fast-fail opportunities);faster time to market; better definition of your MVP (Minimum Viable

Product) for quick wins with lower product failure rates; and generally

creating a dynamic, engaging team atmosphere across the organization In

Trang 11

addition to these general Agile characteristics, DataOps accelerates currentdata analytics capabilities, naturally exploits new fast data architectures (such

as schema-on-read data lakes), and enables previously impossible analytics.With a sharpened focus on each MVP and the corresponding SCRUM

sprints, DataOps minimizes team downtime from both lengthy review cyclesand the costs of cognitive switching between different projects

Mature data science capability reaches its full potential in an agile DataOpsenvironment

Trang 12

3 …leverages the crowd and works collaboratively with businesses (i.e., data champions, hackathons, etc.).

Data science groups that live in a bubble are missing out on the best

community out there Activities that promote data science for social good,including open or internal competitions (like Kaggle), are a great way to

sharpen skills, learn new ones, or just generally collaborate with other parts

of the business

In addition, mature data science teams don’t try to go at it alone, but insteadwork collaboratively with the rest of the organization One successful tactic issponsoring internal data science competitions, which are great for team

building and integration The mature data science organization has a

collaborative culture in which the data science team works side by side withthe business to solve critical problems using data

Another approach is internal crowdsourcing (within your organization) —this is particularly strong for surfacing the best questions for data scientists totackle The mature data science capability crowdsources internally severaldifferent tasks in the data science process lifecycle, including data selection;data cleaning; data preparation and transformations; ensemble model

generation; model evaluation; and hypothesis refinement (see “4 …followsrigorous scientific methodology (i.e., measured, experimental, disciplined,iterative, refining hypotheses as needed).”) Since data cleaning and

preparation can easily consume 50–80% of a project’s entire effort, you canaccrue significant project time savings and risk reduction by parallelizing(through crowdsourcing) those cleaning and preparation efforts, especially bycrowdsourcing to those parts of the organization that are most familiar withparticular data products and databases

Also, algorithms don’t solve all problems It is still incredibly difficult for analgorithm to understand all possible contexts of an outcome and pick the rightone Humans must be in the loop still, and a deep understanding of the

context of the challenge is essential to solid interpretation of data and creatingaccurate models

Trang 13

4 …follows rigorous scientific methodology (i.e.,

measured, experimental, disciplined, iterative, refining hypotheses as needed).

Exploratory and undisciplined are not compatible Data science must be

disciplined That does not mean constrained, unimaginative, or bureaucratic.Some organizations hire a few data scientists and sit them in cubes and

expect instant results In other cases, the data scientists work within the ITorganization that is focused on operations, not discovery and innovation.Mature data science capability is built on the foundation of the scientificmethod First, make observations (i.e., collect data on the objects, events, andprocesses that affect your business) — collect data in order to understandyour business by embedding measurement systems or processes (or people)

at appropriate places in your business workflow Think of interesting

questions to explore, and then formulate testable hypotheses with your

business partners Once you have a good set of questions and hypotheses,then test them — analyze data, develop a data science model, or design a newalgorithm to validate each hypothesis, or else refine the hypothesis and

iterate This methodology will ensure that value is created when formal

scientific rigor is applied That’s an undeniable sign of mature data sciencecapability

A key part of the scientific process is knowing the limits of your sample.Looking for and testing for selection bias is key Similarly, it is important tounderstand that “big data” does not spell the end to incomplete samples

(unfair sampling) or sample variance (natural diversity)

Trang 14

5 …attracts and retains diverse participants, and grants them freedom to explore.

The key word is diverse What fun is a bunch of math nerds? (Three

statisticians go out hunting together After a while they spot a solitary rabbit.The first statistician takes aim and overshoots the rabbit by one meter Thesecond aims and undershoots it by one meter The third shouts out “We gotit!”) Some organizations are looking for data scientists who are great coders,who also understand and apply complex applied mathematics, who know alot about the specific business domain, and who can communicate with allstakeholders One or two such people may exist — we call them purple

unicorns Mature organizations recognize that data science is a team sport,with each member contributing valuable unique skills and points of view.Among those skills and competencies are these: Advanced Database/DataManagement & Data Structures; Smart Metadata for Indexing, Search, &Retrieval; Data Mining (Machine Learning) and Analytics (KDD =

Knowledge Discovery from Data); Statistics and Statistical Programming;Data & Information Visualization; Network Analysis and Graph Mining(everything is a graph!); Semantics (Natural Language Processing,

Ontologies); Data-intensive Computing (e.g., Hadoop, Spark, Cloud, etc.);Modeling & Simulation (computational data science); and Domain-SpecificData Analysis Tools

But don’t think that every person must have at least one of those technicalskills at the outset — some of the best data science organizations grow thoseskillsets from within, by identifying the core aptitudes among their currentstaff that lead to data science success (even within nontechnology trainedstaff) Those core aptitudes include the 10 C’s: curiosity (inquisitive),

creativity (innovative), communicative, collaborative, courageous solver, commitment to life-long learning, consultative (can-do, will-do

problem-attitude), cool under pressure (persistence, resilience, adaptability, and

ambiguity tolerance), computational, and critical thinker (objective analyzer).Diverse perspectives are beneficial on multiple fronts They make the

Trang 15

questions more interesting, but more importantly they make the answers evenmore interesting, useful, and informative Answers are given greater contextthat can yield greater impact Mature data science capability understands thatyou need more than just math or computer science folks on projects Themature organization integrates business experts, SMEs, “data storytellers”,and creative “data artists” seamlessly, and then grants them the freedom toexplore and exploit the full power of their data assets The output from suchdiverse teams will be richer than that from any purple unicorn And

remember, it is better to have both a horse and a narwhal than a unicorn!

Trang 16

6 …relentlessly asks the right questions, and constantly searches for the next one.

The fundamental building block of a successful and mature data science

capability is the ability to ask the right types of questions of the data This isrooted in the understanding of how the business runs or how any businesschallenge manifests itself The best data science team covers all the aptituderequirements mentioned earlier (see “5 …attracts and retains diverse

participants, and grants them freedom to explore.”): curious, creative,

communicative, collaborative, courageous problem solvers, life-long learner,doer, and resilient

Mature data science capability is exemplified in the relentless pursuit of newquestions to ask (even questions that could never be answered before) and inasking questions of the questions! Data science maturity frees the

organization to ask the hard questions across the entirety of the business, isdisciplined in how it asks those questions, and is not afraid of getting the

question to ask of your data (at the right time, in the right context, for theright use case) This “cognitive” ability to come up with not just the rightanswers but with the right questions (especially questions that were neverasked or considered before) is the highest level of both analytics maturity anddata science capability maturity As the adage says: “The only bad question isthe one that you don’t ask.”

Ngày đăng: 04/03/2019, 16:13

TỪ KHÓA LIÊN QUAN