M A N A G E R ’ S G U I D E What Leaders Must Know About Data for Machine Learning ON BEHALF OF MIT SMR CONNECTIONS MANAGER’S GUIDE — WHAT LEADERS MUST KNOW ABOUT DATA FOR MACHINE LEARNING MIT SMR CON.
Trang 1What Leaders Must Know About
Data for Machine Learning
Trang 2MIT SMR Connections develops content in collaboration with our sponsors
It operates independently of the MIT Sloan Management Review editorial group.
Copyright © Massachusetts Institute of Technology, 2020 All rights reserved.
What Leaders Must Know About Data to Drive Success With Machine Learning 2
1 Align machine learning initiatives with business priorities
2 Create and maintain a comprehensive view of all data assets
3 Lay the groundwork for data governance
4 Identify the specific roles required to build a strong data foundation
for machine learning
Data Management Strategy Checklist 5
Sponsor’s Viewpoint: Your Data Strategy Is Key to Machine Learning;
a Data Lake Can Help 6
Trang 3Machine learning is taking predictive analytics to the
next level to drive tangible business value for a wide
array of industries Algorithms allow credit card
companies to detect fraud in real time and help retailers direct
offers to the customers most likely to respond In health care,
tools powered by machine learning help doctors transcribe
notes more easily so they can focus on patient care
Manufac-turers can take in data from sensors on plant equipment and
recommend maintenance before malfunctions cause
produc-tion delays
But machine learning models are only as good as the data
they ingest “If data is not clean, if it’s not accessible, if it isn’t
stitched together to form a strong foundation, the machine
learning and artificial intelligence capabilities built on top of it
will have problems,” warns Ashok Srivastava, senior vice
pres-ident and chief data officer at financial software provider
In-tuit This can lead to difficulties such as inaccurate insights or
inherent bias — factors that can hamper intelligent business
decision-making
Fortunately, businesses can avoid these perils by designing a
data management strategy that develops new capabilities,
ini-tiatives, and roles around machine learning This guide aims to
share lessons from business leaders and industry experts on
how, with the right policies and frameworks in place, data can
serve as a strategic corporate asset
1 Align machine learning initiatives with business priorities
The first step in creating an enterprise data management
strat-egy is understanding the business’s goal for machine learning
For example, Intuit’s machine learning initiatives aim to im-prove customer service by providing personalized recommen-dations to subscribers of its accounting and tax software pro-grams An online retailer may plan to use machine learning to create more-effective targeted marketing campaigns, while an automotive manufacturer may be building machine learning systems to predict equipment failures
Establishing which of a business’s strategic priorities have the best potential to be advanced via machine learning provides clarity around which data sets are most important to collect, store, and prepare for analysis
“Being focused on knowing what data is truly driving your business and matters most is the first piece to a data strategy,” says Juan Tello, chief data officer at Deloitte Consulting and principal in its Strategy & Analytics practice “So, for example,
if business priorities are to win more customers and provide more-competitive pricing based on the products a company sells, that requires three critical data domains: customer data, pricing data, and product data Prioritizing the data strategy
on those areas as a starting point will maximize business out-comes Organizations should also reevaluate and adjust as their business priorities change.”
This focus is essential, given the vast volumes of data gener-ated by enterprise applications, connected devices, and cus-tomer interactions via the web or social media platforms, to name just a few sources However, by narrowing the scope for data management to three or four key sources, businesses can focus on those data sets that will deliver the most value What Leaders Must Know About Data to
Drive Success With Machine Learning
Trang 42 Create and maintain a comprehensive view of all data assets
For data to be useful, a business must know it exists
Unfor-tunately, legacy systems, mergers and acquisitions, and poor
data onboarding practices can create silos of unidentified and
untagged information
At Intuit, data management experts “meet with the teams that
own data systems or data pipelines, and we start to build a
cat-alog of that information That means understanding what data
they have and how it is stored.” The result, says Srivastava, is
“a robust list of data assets that we have within the company.”
But data troves are constantly evolving as businesses deploy
new systems GE Healthcare offers a perfect example of how
to stay ahead of the curve The manufacturer of diagnostic
im-aging equipment, which uses machine learning algorithms to
improve traditional imaging technologies like CT scanning and
X-ray, continuously works with collaborators and partners to
inventory and onboard de-identified data A dedicated team
of data specialists receives, processes, and properly catalogs
contractually de-identified data sets and then uploads them
for use in AI development This process leads to greater data
transparency and availability
Business leaders must also be held accountable for
maintain-ing a comprehensive view of data assets At GE Healthcare,
chief data officer Derek Danois says, broad communication
and transparency are key to building trust: Business units now collaborate so that the company knows the moment a new data set becomes available
3 Lay the groundwork for data governance
At the core of every data management strategy is data gov-ernance — a set of rules and systems that ensures that data
is secure, handled in compliance with applicable regulations, accessible, and useable
Data security and compliance with privacy laws are table stakes and as such have been the primary drivers of data governance for most enterprises In addition to guarding against intrud-ers via cybintrud-ersecurity measures that protect the IT perimeter, businesses must also establish controls that limit how data is accessed, used, and managed by employees This typically means granting different access levels depending on vari-ables such as role, tenure, and function Compliance with regulations such as the European Union’s GDPR (General Data Protection Regulation) and similar requirements in other jurisdictions means that companies must also be prepared
to explain to consumers how their data is being used to make decisions that affect them
Another key component of data governance is quality: A machine learning model’s output depends on the quality of its training data
At Intuit, data management experts meet with
the teams that own data to build a catalog of
that information, resulting in a robust list of data
assets within the company.
Trang 5metrics A medical imaging study might be vetted for standard-
of-care parameters (such as slice thickness or scan geometry),
field of view (the area of a scanned object), and metadata
content requirements If quality standards are met, GE
Health-care de-identifies or anonymizes the data and establishes a
chain of custody that chronicles the data’s control, transfer,
and analysis, before it’s uploaded for use in AI development
Maintaining consistently high levels of data quality calls for
continuous monitoring of metrics and key performance
indi-cators such as accuracy, timeliness, consistency, and integrity
— a process that can become overwhelming, according to
Tello Using AI-powered data quality tools can accelerate the
ability to manage and govern data, he says Enterprise master
data management software can also ease the burden by creating
a single master reference source for all critical business data,
thereby reducing redundancies and the likelihood of errors
4 Identify the specific roles required to build a strong
data foundation for machine learning
An explosion of new data science job titles has raised questions
regarding who is responsible for which tasks within a machine
learning practice A well-thought-out organizational structure
can make sense of this landscape by clarifying roles and
delin-eating responsibilities
some of the key roles required to execute a data management strategy include the following:
• Chief digital/data officer: Oversees all digital functions,
provides support and leadership, and articulates a strategy for data governance that’s consistent across the company
• Data scientist: Creates tools or processes based on
machine learning and applies them to well-defined business problems
• Decision scientist: Uses expertise in technology, math,
and statistics, along with business domain knowledge,
to enable informed decision-making
• Compliance/legal team member: Handles privacy,
compliance, data rights, and regulatory aspects impacting
a business
Ancillary positions include data management specialist, busi-ness intelligence specialist, and data architect
But there’s also a place for sales executives, HR managers, and chief marketing officers in machine learning initiatives “The business owners who are making decisions on a daily basis are some of the most important contributors to our overall data strategy,” says Intuit’s Srivastava
That’s because business leaders possess domain knowledge
— an in-depth understanding of the relevant data within the enterprise, the processes that generate useful data, what data might be useful for a model, and how different variables might impact a model’s output Without this guidance, businesses risk creating machine learning applications that don’t deliver useful results
Looking Forward
Machine learning has the potential to improve results in nearly every aspect of business But to harness it, businesses need a data management strategy that will continuously improve the quality, integrity, access, and security of data l
“The business owners
who are making decisions
on a daily basis are some
of the most important
contributors to our overall
data strategy”
ASHOK SRIVASTAVA, INTUIT
Trang 6[ 3 ] Establish rules and processes around how data is sourced, managed, accessed,
and used across the business.
[ 3 ] Ascertain which data sets are driving the business and how they can be used to help solve problems,
generate revenue, and deliver customer benefits.
[ 3 ] Inventory known data assets, classify them, and organize them in a data catalog.
[ 3 ] Meet with the teams that own and operate data systems to better understand what data they
have and how it is stored.
[ 3 ] Understand where your data comes from, who has access, and how it can be used.
[ 3 ] Establish internal security precautions (such as provisioning user access), as well as external safeguards
(such as anonymizing data), to protect sensitive data.
[ 3 ] Create access controls that set limitations around how data is accessed and how it might be used.
[ 3 ] Design processes and systems to ensure that data created is accurate and useful.
[ 3 ] Identify specific roles required to build a strong data foundation, including chief digital officer,
data scientist, decision scientist, and compliance team member.
DATA MANAGEMENT
STRATEGY CHECKLIST
Keep the following practices in mind to successfully design and execute
a data management strategy in support of machine learning:
Trang 7et ma quunt lam, volorei untio Commodio es delibus aut ex
eum quiatur sa desci aut magnam eum raeprat utassint
volup-tio Et voluVident Ehenitatis mo omni ut magnis sitiist, siti odis
cone doles pore laborum et la corit dolupta turiam etur, am
recta dolores endenimusam, tem que latesti simillupti
simpo-rempore sedit inis quam, sim raturia
Sam natius sa quiaerovit, occabor eiumquunto dolorectium
archill issitatur? Aliquos andipsam ea por renduci delent, sunt
eum dus nita quiatur, sit pa aditae veles pere, ommodisquis aut
modi delenest hiligenimped quuntiis simporp oraestius
maxi-mus quo estiani hiciis si is restrumet aut
Subhead
Git asimenis es doluptam is nit, volorero voluptas aut aut
lan-dam, omni rerspid quam ipsande rchitae volor rem dis sit plat
es estotaq uiatium duntem faccus eum si doluptiis essedi im
fuga Dit omniantios reri delessequodi quia consequi ipieture
lignata dolo consequo et landiostio illuptas exceptat quia
con-sequi ipieture lignata dolo consequo et landiost aliquiat
Ibusdae nos suntiis se nullaute occaerf erchicat velenem
fu-giaturit et et od qui oditia dolores et veliqui res remporitat inci
ulpa est, apedips ametustem eos etur?Da nobitis possed
qua-met es mo beate et estem nonsequiant voleseque mint,
optat-ur? Um, imusandis ernamust abo Lorion cus vellis doluptas
nullesciis unto et fugiatia dis issum eat
Obis apedipsa delesto doluptatiur? Quis consendae volupta
spicta ne ium discidu ntorestem nest, tem quo eaqui
dips-aperibus rempore dis ent, ut laut aut est, sitas doluptati re sint
dolupiet proreic tem alitem Et porporem non conse corro eos
solorumquae niendis deror mod unt
Onsecte dolent Poressi alibus maion et facestius di to duci ut
pro et laut arum quam, ulliqui nis iur?
Et aceati ut pro cum dolora volorio venimod ellenimet, conem
Caerunt offic te exeribeat a dolupic temquost, venditas dolla
del inum ipidendanda ea arum iliquamendae sed quia cuptame
nditat magniat uritatem fugitia simpor solum re as doluptate
etur?
Ribusanis debis dolestore elic tem ipsaerum qui temolliquas mod eum undelicil ipsaepu ditam, volupitae porunt, ut faccus aut et la estibus totaspera quatem susanimi, id magnati stiasit aci tet ad maximen iscitat verorruntus ex ex est facea conse-quati andae id esed quuntium exeruptios autem ut volent pere nobitature nonse verum as dipsamus non plit, explam saest et utatus iuscimil expe ra si voloreium ut hario experuntum hil-ibus
Aquid et anda cusam nulparu ptaturi to volupti onsequia conem quam re, omnissum ea es acieniam, voluptas dolorporias am volendae dolutem Nam quia vitiur reperchil maximus moditat empedis cienis apereperero ipsandus, sant am hit optatasima nihici velescit aliquam quam et volor modite sam voloriatist, offic te dolorrore nes aborianis duntio In porporem undipsa-perem qui volores sit et apis ant
Arum hicius autatem fugitaque voluptatibus aut aut ad ute conse cum invellabores quaepre ex enis quam, et, sersperun-tur a vel elibus ma sequam into tem et, nos maior simus maxi-met lab idenda quiae Aximossum liquam net fugit quamaxi-met aut voluptat lit eictae pre dolupti nos plitempore, to moluptatem incia num quam se aspe pa volorem aditiasim inciandes molec-tatus is reremperibus es natem cus inisciae ped qui ut odis et aliquid itatur reicil eumeturitas endit, cum simi, quo cor as mos
ex et, enes volupta turibus
Elendes toruptatem et quo minumqu atatis porpori tatust
et volo ommolen imenim et audaepu diciis dolum idi corpo-remped eum, consedic tentiasperis veruntio Lor alicimi nven-tecese nulparu ntiaspi duciam fugiaepudam re omnisqu aturiti simusant ullab idist, tempost utectem ea des eritatis rerferum aceria non porrunt, conet evellaute et omnit, simenda nissimus dolentur? Quibust, utem Qui audipsam, vellam, ut eicimus sol-orum qui aut as accabor ectibus ius esti at eos eos eiusand
itat-ur aniscil ibusdae reheni cum dolest, aliciis min et periatitat-ur? Pedigenia nos ad que seque volenim aut moluptas sam sedios millest eturiorae ventiis qui quae dent eum exces doloria sse-quis aliqui voleconsequiata volum quiaeru ntiisci to et eossum omnist laboreh
Your Data Strategy Is Key to Machine Learning; a Data Lake Can Help
Machine learning success is highly dependent on having relevant and high-qual-ity data Without a proper data strategy in place, machine learning initiatives fail
to scale Worse yet, if the machine learning models are informed by bad data, the results they generate may be misleading — or even incorrect
The right data strategy for machine learning should aim to break down silos, enabling your IT teams to easily, quickly, and securely access and collect the data they need While modern data strategies take many forms, data lakes are becoming
an increasingly popular core component of the most efficient models Data lakes offer more agility and flexibility than traditional data management systems, allowing organizations to manage multiple data types from a wide variety of sources and to store the data — whether structured or unstructured — in a centralized repository
Once stored, the data can be leveraged by many types of analytics and machine learning services faster and more efficiently than with traditional, siloed approaches
Data lake architectures also enable multiple groups within the organization to ben-efit from analyzing a consistent pool of data that spans the entire business For help developing a more holistic data strategy that includes data lakes, interact with the AWS Data Flywheel
Amazon’s ML Solutions Lab program can also help you build the right data strategy
The Amazon ML Solutions Lab pairs your team with Amazon machine learning experts to prepare data, build and train models, and put models into production
It combines hands-on educational workshops with brainstorming sessions and advisory professional services to help you essentially work backward from business challenges and then go step-by-step through the process of developing solutions based on machine learning Moreover, one of our machine learning partners can also help you build the right data strategy for your machine learning initiatives
AWS Machine Learning Competency Partners have demonstrated relevant expertise and offer a range of services and technologies to help you create intelligent solutions for your business, from enabling data science workflows to enhancing applications with AI services Learn more at aws.ai
About Amazon
Web Services
AWS offers the
broadest and deepest
set of machine learning
and Al services On behalf
of our customers, we
are focused on solving
some of the toughest
challenges that hold back
machine learning from
being in the hands of
every developer Tens of
thousands of customers
are already using AWS for
their machine learning
efforts You can choose
from fully managed Al
services for computer
vision, language,
recommendations,
forecasting, fraud
detection, and search; or
Amazon SageMaker to
quickly build, train, and
deploy machine learning
models at scale
SageMaker Studio offers
the first fully integrated
development
environ-ment for machine
learning You can also
build custom models
with support for all of
the popular open-source
frameworks Our
capabilities are built on
the most comprehensive
cloud platform, optimized
for machine learning
with high-performance
computing and
no compromises on
security and analytics
Learn more at aws.ai