IT training navigating health data ecosystem khotailieu

1 Background 1 Introduction 1 Complexity: Enormous Domain, Noisy Data, Not Designed for Machine Consumption 2 Computing: Standards and Inter-System Exchangeability 4 Context: Critical Me

Trang 2

Make Data Work

strataconf.com

Presented by O’Reilly and Cloudera, Strata + Hadoop World is where cutting-edge data science and new business fundamentals intersect— and merge.

n Learn business applications of data technologies

nDevelop new skills through trainings and in-depth tutorials

nConnect with an international community of thousands who work with data

Job # 15420

Trang 3

Ian Eslick, Tuhin Sinha, Roger Magoulas, and Rob Rustad

Navigating the Health

Data Ecosystem

Trang 4

[LSI]

Navigating the Health Data Ecosystem

by Ian Eslick, Tuhin Sinha, Roger Magoulas, and Rob Rustad

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department:

800-998-9938 or corporate@oreilly.com.

May 2015: First Edition

Revision History for the First Edition

2015-05-05: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Navigating the

Health Data Ecosystem, the cover image, and related trade dress are trademarks of

O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights.

Trang 5

Table of Contents

The “Six C’s”: Understanding the Health Data Terrain in the Era of

Precision Medicine 1

Background 1

Introduction 1

Complexity: Enormous Domain, Noisy Data, Not Designed for Machine Consumption 2

Computing: Standards and Inter-System Exchangeability 4

Context: Critical Metadata for Accurate Interpretation 5

Culture: Lean Start-Up Difficulties in Hospital Ecosystems 7

Contracts: Navigating IRB, HIPAA, and EULA Frameworks 8

Commerce: How Do Digital Health Start-Ups Get Paid? 10

Summary 11

v

Trang 7

The “Six C’s”: Understanding the Health Data Terrain in the Era of

Precision Medicine

Background

A few years ago, O’Reilly became interested in health topics, run‐ ning the Strata RX conference, writing a report on “How Data Sci‐ ence is Transforming Health Care: Solving the Wanamaker

grew to include people in the health care space, informing our nas‐ cent thoughts about data in the age of the Affordable Care Act and the problems and opportunities facing the health care industry We had the notion that aggregating data from traditional and new device-based sources could change much of what we understand about medicine — thoughts now captured by the concept of “preci‐ sion medicine.” From that early thinking, we developed the frame‐ work for a grant with the Robert Wood Johnson Foundation (RWJF)

to explore the technical, organizational, legal, privacy, and other issues around aggregating health-related data for research — to pro‐ vide empirical lessons for organizations also interested in pushing for data in health care initiatives This report begins the process of sharing what we’ve learned

Introduction

After decades of maturing in more aggressive industries, data-driven technologies are being adopted, developed, funded, and deployed throughout the health care market at an unprecedented

1

Trang 8

scale February 2015 marked the inaugural working group meeting

of the newly announced NIH Precision Medicine Initiative designed

to aggregate a million-person cohort of genotype/phenotype dense longitudinal health data, where donors provide researchers with the raw epidemiological evidence to develop better decision-making, treatment, and potential cures for diseases like cancer In the past several years, many established companies and new start-ups have also started to apply collective intelligence and “big data” platforms

to health and health care problems All these efforts encounter a set

of unique challenges that experts coming from other disciplines do not always fully appreciate

In 2014, the Robert Wood Johnson Foundation funded the subject

of this report, a research effort called “Operationalizing Health Data,” a deep dive into the health care ecosystem focused on under‐ standing and advancing the integration of personalized health data

in both clinical and research organizations RWJF encouraged the small group of data scientists, innovators and health researchers working on the grant to find and prototype concrete solutions fac‐ ing several partner organizations trying to leverage the value of health data The research intends to empirically inform innovation teams, often coming from non-health-related industries, about the messy details of using and making sense of data in the heavily regu‐ lated hospital IT environment

This report describes key learnings identified by the project across six major facets of the health data ecosystem: complexity, comput‐ ing, context, culture, contracts, and commerce In future reports, we will focus on specific tactical challenges the project team addressed

Complexity: Enormous Domain, Noisy Data, Not Designed for Machine Consumption

In marked contrast to much of the data generated in enterprise or consumer markets, health care data is exceedingly complex, and this complexity makes direct application of the techniques we’ve learned

in other industries surprisingly challenging Underlying this is the simple fact that the human organism has no closed-form solution Despite thousands of years of study, our real-world comprehension

of human physiology is largely indirect and sparse When coupled with the other challenges already inherent in data-intensive applica‐ tions, the fact that we don’t necessarily know the root causes for

2 | The “Six C’s”: Understanding the Health Data Terrain in the Era of Precision Medicine

Trang 9

measured chemical and biological changes makes health data analy‐ sis and analytics particularly demanding

Nearly all data derived from a biological system is messy, whether captured via device, blood test, medical record, or survey Working with health data requires understanding the innate challenges of the data as well as managing many other difficulties, such as:

• Measurements are not typically stable; there are many possible sources of variation

• Electronic Medical Record (EMR) discrete data is often entered

by hand; even parsing can be challenging

• The same underlying data can be encoded or labeled in multiple ways

• A vast system of legacy systems and protocols must often be navigated

• Personal health data tends to be dominated by longitudinal/ time-series data; interpretation of this data is not necessarily well understood by either researchers or clinicians in practice

We can see examples of these challenges in work performed with a partner developing a personal health app that presents a history of laboratory test results to a patient Laboratory measurements such

as Serum Albumin, a measure of the blood concentration of a pro‐ tein produced by the liver, provide evidence of potential health problems or risk factors The goal of the app is to enhance the clini‐ cal visit and give patients agency by helping the patient reflect on the history of their test results along with questions they might want to ask their clinician It’s a simple concept to describe, but not so sim‐ ple to execute

The value produced by a blood test for a single patient will vary from laboratory to laboratory as well as periodically over time To provide a point of comparison, laboratories provide reference ranges with their test results These ranges define what is a normal (in range) or abnormal (out of range) result Reference ranges typi‐ cally tell you whether you are in the same basic range as 80-95% of the population, but they do not typically tell you whether a given measure is significant for you personally In our test population, Serum Albumin lower thresholds for when a measure became

“abnormal” was 3.3 or 3.8 g/dL, depending on the laboratory used

Complexity: Enormous Domain, Noisy Data, Not Designed for Machine Consumption | 3

Trang 10

Given that the mean value of all samples together was 3.3, these thresholds become central to determining when a patient is at obvi‐ ous risk

These causes of variation confound our ability to directly aggregate across multiple patients and laboratories Moreover, no common convention exists to normalize laboratory results for aggregation, prediction, or optimization Do we aggregate the discrete interpreta‐ tion of inside, above, below the reference range? Do we aggregate based on the standard deviation of a measure — does the data even have a normal distribution? Do we just ignore the noise, aggregate the values, and rely on the law of large numbers over a large popula‐ tion of patients? These decisions all require a fairly sophisticated understanding of the inference you are trying to draw from the aggregate data set

Perhaps the most interesting question when dealing with health data

is what a specific measure means for an individual Clinicians do this for us all the time For example, many people with naturally high “bad” (LDL) cholesterol have compensating high levels of

“good” (HDL) cholesterol They shouldn’t necessarily be on a statin, yet they can be well above the upper limit that research suggests is a tipping point for increased risk of heart failure The clinician knows

to ignore these values for this patient based on all the other factors; it’s not a straightforward computation for the machine

Clinicians typically refer to population level results to guide individ‐ ual decisions However, we can also use personal health data to cap‐ ture a “baseline” so we can compare our health today to our health

in the past Baselines help us answer critical questions about whether we are stable, how we respond to therapies, etc Personal health baseline measurements also enable a much more precise rea‐ soning about the significance of a change when what is normal for

us is not normal for everyone else

Computing: Standards and Inter-System

Exchangeability

Accessing and parsing data can also be a significant challenge Most electronic medical records are not much better than electronic paper

— meaning that the data entered into it are entered for purposes of discoverability (to help all providers understand the patient case),

Trang 11

documentation (what happened for legal reasons), regulatory com‐ pliance, and billing (documenting care for the payer) This point is essential: data is entered into medical records primarily so that other people can find and review it It is not entered to enable automated

or aggregate analysis So, while EMRs are a giant leap forward, they are not a panacea for machine learning and suffer from significant garbage-in, garbage-out problems

One of the hospital systems we talked to still receives all of its labo‐ ratory data by fax image and hand transcribes the fax content into its EMR The recorded laboratory data typically stores the data in certain database fields in certain ways, but there is variation in how data is encoded across both technicians and laboratories for the same laboratory test You will continue to find bugs in the data for weeks or months after first starting to exchange data These issues with EMR data make precision clinical medicine a greater challenge than more established uses of EMR data, such as population man‐ agement

Another partner has spent nearly $20 million in grant money on a project over the past five years building a standardized registry for a disease condition that simply makes a standard form available in more than 40 centers so the data can be relayed into an open source registry system (Indivo) that is used to perform analysis across more than 15 thousand patients

The bottom line is there is no standard, interoperable schema for documenting human health in a digital format — the way cars in a manufacturing system can be — and until some agreed-upon meth‐ odology for doing that exists, teams working on both intra- and inter-hospital data aggregations will struggle to generate apples-to-apples normalized results

Context: Critical Metadata for Accurate

Interpretation

In addition to the testing variations described above, the values pro‐ duced by a laboratory blood test are subject to tremendous variation due to contextual factors such as the time of day the blood was drawn, what the patient was eating or drinking at the time, the han‐ dling of the specimen, the time between draw and analysis, the spe‐ cific method of analysis, etc A high value for a given parameter may

Context: Critical Metadata for Accurate Interpretation | 5

Trang 12

or may not have clinical relevance, even if you are using a personal baseline For example, if you forgot to skip your morning coffee before taking a pre-diabetic blood sugar test, you can get a false pos‐ itive for high blood sugar

Like blood tests, many medical data sets will have only limited machine consumable metadata describing what can be essential context for clinical and research analysis of provided data One lead‐ ing researcher at the NIH we spoke with argued that the primary reason he is not interested in patient-provided data is the lack of this critical contextual data

The importance of context in interpreting data in health is one of the key barriers for those who seek to augment or replace clinicians with analytics An analytics system is only as good as the input, and today, health care systems do not give us very good inputs When we examined laboratory results, we saw cases of missing reference ranges; miscoded data; and, for some measures, a great deal of noise For example, we saw records of “failed pulmonary function test” created for billing purposes that had no specific and actionable information about the patient’s health status The only way to tell which tests failed was to read the free text notes associated with the test; no single regular expression allowed for automatic filtering of these failed tests

These challenges arise in personal health data as well A motion sen‐ sor tells us when we have activity, but it doesn’t tell us when inactiv‐ ity is due to the sensor being located in a purse, on a table, or because we’re actually sitting at a desk or on the couch If you are using this signal to assess the actual activity level of a single person,

it might be insufficient for clinical use However, the opportunity with personal health data is that we can triangulate across several signals to assess the actual context, and it is often possible to engage the user periodically to fill in critical blanks

Personal health data, properly managed, can make a powerful con‐ tribution to the health care system by augmenting the impoverished context clinicians currently get during patient visits Evidence shows

that patients interact more honestly with machines and that devices assess patient behavior more accurately than subjective self-report New mobile phone health data frameworks from Apple and Google, along with many independent phone/sensor-based applications can provide a rich source of contextual clarity The Health Data Explora‐

Định dạng
Số trang	18
Dung lượng	2,23 MB