Pouria has done research and developmentprojects and lectured about Big Data, Data Science, Machine Learning, SpatialDatabases, GIS and Spatial Analytics since 2008.Trudie Lang is Profes
Trang 1SPRINGER BRIEFS IN PHARMACEUTIC AL
SCIENCE & DRUG DE VELOPMENT
Trang 2Science & Drug Development
Trang 4Pouria Amirian Trudie Lang
Francois van Loggerenberg
Editors
Big Data in Healthcare
Extracting Knowledge from Point-of-Care Machines
123
Trang 5University of OxfordOxford
UK
ISSN 1864-8118 ISSN 1864-8126 (electronic)
SpringerBriefs in Pharmaceutical Science & Drug Development
ISBN 978-3-319-62988-9 ISBN 978-3-319-62990-2 (eBook)
DOI 10.1007/978-3-319-62990-2
Library of Congress Control Number: 2017946047
Editors keep the copyright
© The Editors and Authors 2017
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, speci fically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a speci fic statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional af filiations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Trang 61 Introduction—Improving Healthcare with Big Data 1Francois van Loggerenberg, Tatiana Vorovchenko and Pouria Amirian
2 Data Science and Analytics 15Pouria Amirian, Francois van Loggerenberg and Trudie Lang
3 Big Data and Big Data Technologies 39Pouria Amirian, Francois van Loggerenberg and Trudie Lang
4 Big Data Analytics for Extracting Disease Surveillance
Information: An Untapped Opportunity 59Pouria Amirian, Trudie Lang, Francois van Loggerenberg,
Arthur Thomas and Rosanna Peeling
5 #Ebola and Twitter What Insights Can Global Health Draw
from Social Media? 85Tatiana Vorovchenko, Proochista Ariana, Francois van Loggerenberg
and Pouria Amirian
Index 99
v
Trang 7Pouria Amirian has a Ph.D in Geospatial Information Science (GIS) and is aPrincipal Research Scientist in Data Science and Big Data at the OrdnanceSurvey GB and a Data Science Research Associate with the Global HealthNetwork He managed and led a joint project (Oxford and Stanford) on“Using BigData Analysis Tools to Extract Disease Surveillance Information fromPoint-of-Care Diagnostic Machines” Pouria has done research and developmentprojects and lectured about Big Data, Data Science, Machine Learning, SpatialDatabases, GIS and Spatial Analytics since 2008.
Trudie Lang is Professor of Global Health Research, Head of the Global HealthNetwork, Senior Research Scientist in Tropical Medicine at Nuffield Department ofMedicine and Research Fellow at Green Templeton College at the University ofOxford She has a Ph.D from the London School of Hygiene and TropicalMedicine and has worked within the industry, the World Health Organisation(WHO), NGOs and academia conducting clinical research studies in low-resourcesettings Dr Lang is a clinical trial research methodologist with specific expertise inthe capacity development and trial operations in low-resource settings She cur-rently leads the Global Health Network (GHN), which is a focused network ofresearchers to help clinical researchers with trial design, methods, interpretation ofregulations and general operations
Francois van Loggerenberg is Scientific Lead of the Global Health Network,based out of the Centre for Tropical Medicine and Global Health, NuffieldDepartment of Medicine Originally trained as a research psychologist, from 2002
to 2012, Francois was employed at the Nelson R Mandela School of Medicine inDurban, South Africa, where he worked initially as the study coordinator on a largeHIV pathogenesis study at the Centre for the AIDS Programme of Research inSouth Africa (CAPRISA) In 2005, he was awarded a Doris Duken FoundationOperations Research For AIDS Care and Treatment In Africa grant that funded hisPh.D work on enhancing adherence to antiretroviral therapy (2011, London School
of Hygiene and Tropical Medicine)
vii
Trang 8Introduction —Improving Healthcare
with Big Data
Francois van Loggerenberg, Tatiana Vorovchenko
and Pouria Amirian
With the advancement of computing systems and availability of new types ofsensors, there has been a huge increase in the amount, type and variety of data thatare collected and stored [1] By some estimates in 2013, over 90% of the world’sdata had been created in the previous two years [2] In terms of health data, this hasbeen impacted on by the increased use of Electronic Health Records (EHR), per-sonalized medicine, and administrative data Although it is difficult to compre-hensively and simply characterise what constitutes Big Data, in terms of data itself,several key characteristics have been identified, which create particular opportu-nities and challenges [3,4] These characteristics include the large size (volume) ofthese datasets, the speed with which these data are generated and collected (ve-locity), the diversity of the data generated (variety) Some sources add a fourth‘V’,veracity, to highlight the fact that the quality of data collected this way needs to becarefully considered [1] However, we discuss veracity later in this book and weprove that this is not a characteristics of data in Big Data and, more importantly, BigData is not just about data [5] As often used, Big Data also refers to datasets thathave been collected for a specific purpose, but used in new secondary analyses, thelinking of datasets collected for different purposes, or for datasets that are generatedfrom routine activity, and often collected and stored autonomously and automati-cally These characteristics create huge and rapidly expanding datasets that are ripefor linking, and for algorithmic analysis to detect and characterise relationships and
F van Loggerenberg ( &) T Vorovchenko P Amirian
University of Oxford, Oxford, UK
e-mail: francois.vanloggerenberg@psych.ox.ac.uk
P Amirian
e-mail: Pouria.Amirian@os.uk
© The Editors and Authors 2017
P Amirian et al (eds.), Big Data in Healthcare, SpringerBriefs in Pharmaceutical
Science & Drug Development, DOI 10.1007/978-3-319-62990-2_1
1
Trang 9patterns that would be very difficult to detect in smaller and individualpurpose-collected datasets.
The use of Big Data in biomedical and health sciences has received a lot of attention
in recent years These data present a significant opportunity for the improvement ofthe diagnosis, treatment and prevention of various diseases, and to interventions toimprove health outcomes [1,6] However, this is tied to the obvious risks to privacyand trust of this sensitive information and the exposure of the vulnerability ofpeople requiring interventions or treatments The Big Data revolution has impacted
on the biomedical sciences largely due to the technological advances in genomesequencing, improvements and digitalisation of imaging, the development andgrowth of vast patient data repositories, the rapid growth in biomedical knowledge,
as well as the central role patients are taking in the management of their own healthdata, including collection of personal activity and health data [3]
Some of the key sources of data for biomedicine and health that have contributed
to the volume, velocity, variety and veracity of health related data are [3]:
• Medical Records—Increased digitalisation of electronic health records (EHR);these data are collected for patient care and follow-up, but are key data sourcesfor secondary analysis and combination with other large data sets of longitudinalfree text, laboratory and other parameters, imaging, medication records, and avast array of other key data When combined with data like genomic data, theserepresent potential sources of making genotype-phenotype associations at thepopulation level
• Administrative Data—These data are usually generated for billing or ance claims, and are not generally available as immediately as EHR data.However, they do have the benefit of usually being coded in a standardised way,and verified with errors corrected, and so represent, usually, higher quality,comparable data
insur-• Web Search Logs, click streams and interaction-based—The internet hasbecome an increasingly important source of information for people about theirhealth complaints, especially prior to seeking professional help, and the sys-tematic collection and analysis of these data have yielded insights into syn-dromic surveillance and potential public health interventions based on concerns.These data have been used to identify epidemic outbreaks [7], and have beenuseful at highlighting potential issues with pharmaceutical side effects, forexample
• Social Media—As social media continues to evolve, its definition is constantlychanging to capture all its features and reflect the role it plays in the modernworld Social media has been describe as being“the platforms that enable theinteractive web by engaging users to participate in, comment on and create
Trang 10content as means of communicating with their social graph, other users and thepublic” [8] Social media continues developing and integrating deeply intohuman lives, and may serve a variety of purposes such as social interaction,information seeking, time passing, entertainment, relaxation, communicatoryutility, expression of opinions, convenience utility, information sharing, andsurveillance and watching others [9] For example, LinkedIn allows its users tobuild professional connections, Facebook is widely used to connect with friends,Twitter allows public broadcasting of short messages, Instagram is used to sharefavourite pictures, and YouTube allows the sharing of videos This area of datacollection and analysis has grown rapidly over recent years, as populations havegreater access to, and generate more and more, social data This areas alsoentails blogs, Q and A sites (like Quora), networking sites, and the data havebeen used to find things like unreported side effects, for monitoringdisease-related beliefs, and to identify or track disasters or disease outbreaks Asone of the projects outlined in this book deals with social media, a bit more will
be said about this specific data type
The number of active social media users has been growing rapidly As of 2015, it isestimated that nearly 2 billion people globally use social networks (Fig.1.1) Socialmedia platforms have differing levels of popularity and a number of active users
As of June 2016 Facebook is the most popular platform with 1590 million users(Fig 1.2)
Big Data Analytics is also being used for health and human welfare One example
of this is Google Flu Trends Millions of users around the world search for healthinformation online Google estimates how much flu is circulating in different
0.97
1.22
1.40 1.59 1.87 2.04 2.22 2.39 2.55 2.72
Number of users in billions
Fig 1.1 Number of social network users worldwide from 2010 to 2014, with projections to 2019 [ 10 ]
Trang 11countries around the world using the data of particular search queries on its searchengine and complex algorithms [7] These data correlate with the data from tradi-tionalflu surveillance systems [12] (Fig.1.3) The reporting lag of these predictions
is around one day, whereas traditional surveillance systems might take weeks tocollect and report the data Although Google are no longer publishing these datapublicly in real time, historical datasets remain available, and newer data areavailable to academic research groups on request
1,590 1,000
900 853 697 640 555 400 320 300 300 249 222 215 200 122 100 100 100 100 100
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 Facebook
Number of acƟve members in millions
Fig 1.2 Leading social networks worldwide as of June 2016, ranked by number of active users [ 11 ]
Fig 1.3 Correlation between Google Flu Trends and National Institute of Infectious Diseases for Japan (2004 –2009)
Trang 12Twitter is an online social networking and micro-blogging platform that enablesusers to send and read short 140-character messages called “tweets”.Micro-blogging allows users to exchange small elements of content: short sen-tences, individual images, or video links [13] Twitter is currently primarily anonline service accessible from computers, tablets and mobile phones Since itslaunch in 2006, the population of Twitter users has been constantly growing, and as
of June 2016 has 400 million active users (Fig.1.2) contributing up to 500 milliontweets per day [13] This is very appealing to Big Data analysts as the data show, inreal time and useful for analysing historical events or patterns, what the concernsare of people from all around the world, suggesting potential research areas andpublic health intervention opportunities in health and human development Thiswill be explored further in Chap.5 A recent review identified three key areas inwhich Twitter has been used in health research: Health issues and problems (cancer,dementia, acne, cardiac arrest, and tobacco use), health promotion (like diet, cancerscreening, vaccination, diabetes etc.), and professional communication (evaluativefeedback to students in clinical settings, and promoting journal articles and otherscientific publications) [13]
The ubiquity of smartphones has gone hand-in-hand with the increase in socialmedia posting, especially of geo-located data, for example in tweets This has alsoled to an increase in the number and types of personal monitoring activities whichhave been exploited by health and other personal monitoring applications [14] Thishas led to vast amounts of monitoring data about personal behaviours, positioning,logging diet, medication adherence, blood sugar levels in the blood, coffee con-sumption, sleep quality, psychological or mental states, health and physical activityindicators being made available from self-monitoring, GPS tracking, and technol-ogy like accelerometers, which has been referred to as the quantified self [2] Theseapplications have been used to create health improvement applications, likesmoking cessation or weight loss promotion and support, but in the process are alsogenerating vast and varied datasets of these indicators which could be mined tofindpotentially useful health data It is possible that these may be used to identify riskfactors, which might be linked back to EHR to identify those requiring intervention
or support to prevent the development of illness At a population level, publichealth interventions could be targeted at specific geographic groups where issueslike obesity, for example, may be identified by these means
Countries
The significant and rapid advances in using Big Data technologies and CloudComputing in developed countries has not been matched by the pace in Low- andMiddle-income countries (LMICs) which has been slower, despite the potential forthese approaches to improve healthcare delivery to improve population health
Trang 13A review of articles looking specifically at the use of Big Data in healthcare inLMICs summarises some of the key potential benefits as well as the challenges thatneed to be overcome [15] In these settings, healthcare is most often delivered invertical programmes (for HIV, TB, Malaria etc.), all of which have stringent datarequirements which have to be addressed, usually by cadres of communityhealthcare workers New ways of collecting data (on smartphones, tablets, orportable computers), and real-time data collection by connecting healthcare devices
to the Internet have made it possible to get around some of the more pressinglogistical and technical barriers to electronic data capture, storage and integration.Technological advances in LMICs are often able to leap-frog some of the devel-opmental steps observed in developed countries For example, mobile phonepenetration in LMICs, especially Sub-Saharan Africa, is often very good andpositively associated with other good development indices, asfixed line installa-tions were often lacking and mobile phone technology was able to be rolled outmore efficiently and more easily as there was no existing infrastructure or tech-nology to compete with [16] This means that there have been rapid and unexpectedadvances in access to technologies that have sometimes taken longer to be adopted
in more developed countries
The benefit to LMICs of good uses of Big Data analytics would be to ensuregood healthcare delivery, identification of risk factors for disease, and rapid iden-
tification of individuals who might benefit from early prevention or interventionefforts This is particularly true given that currently there may be poor servicedelivery, poor governance, and poor data coordination, meaning that modestimprovements in these could reap significant benefits by ensuring that limitedresources are used constructively [15] Currently health systems in these regions aredriven largely by focussing on individual diseases, and the integrative nature of BigData may help to move to a more integrated, horizontal, approach to the researchinto and prevention and treatment of diseases and the causes of poor health.Provision of essentials like clean water, food and good sanitation remainpressing problems, but Big Data analytics could be as useful in supporting humandevelopment as they could be in improving health, and the infrastructure and skillsput in place could be leveraged Certainly, good health and good development aremutually supportive and highly related
For the potential benefits to be properly realised, it is important that the currentgenerally poor governance of global health be addressed to ensure that the properlyinformed, considered and adequately resourced collection of data receives properoversight and stewardship [15] In 2009, the Global Pulse initiative was established
by the United Nations (http://www.unglobalpulse.org) in order “to accelerate covery, development and scaled adoption of Big Data innovation for sustainabledevelopment and humanitarian action” [17] This project has also focussed on somehealth-based applications These include projects that had a strong health-basedfocus, and many more that related to the overlapping concerns of development andwelfare, with some key examples here [18]:
Trang 14dis-• Monitoring of the implementation of mother-to-child prevention of HIV inUganda, using real-time indicator data from health centres across the country topopulate an online dashboard This data collection and sharing allowed for theidentification of bottlenecks in the rollout of the Option B+ treatment (whereexpectant mothers are offered HIV treatment irrespective of the CD4 t-cellcount), and to reveal correlations such as the relationship between stock outs anddrop outs from the programme.
• Data visualisation with interactive maps to support disease outbreak responses
in Uganda capturing free text location data for disease reports, and automatictechniques to convert these to geo-referenced positions, in combination withmap overlays of existing geographical and other data to create interactivevisualisations in an online dashboard
• Using social media to understand public perceptions of immunisation inIndonesia, using a database of over 88,000 Bahasa Indonesian language tweetsfrom between January 2012 and December 2013 Content analysis and filterswere used to determine relevant tweets, and this project revealed how socialmedia was being used to share information relevant to immunisation, analysable
in real time Especially useful was the identification of a core of influencers onTwitter that could be leveraged to provide rapid response communication ifneeded, and if provided with relevant and accurate messages to disseminate
• Understanding awareness of immunisation and the sentiment towards this byusing social media and news content analysis in India, Kenya, Nigeria andPakistan, using data from Twitter and Facebook along with traditional media.Spikes in content were linked to key events (like the attacks on polio workersand campaigners in Pakistan, for example) Network analysis and demographicdata on users were used to identify key influencers in the networks This workled to a better understanding of the utility of social media monitoring to gaindeeper understanding of public sentiment regarding immunisation
• Using social media to analyse attitudes towards contraception and teenagepregnancy in Uganda, by extracting data from Facebook pages and UNICEF’sU-report platform between 2009 and 2014 Facebook data were anonymised andfiltered to identify messages relating to contraception and family planning Aninteractive dashboard was developed and this is publicly accessible (http://familyplanning.unglobalpulse.net/uganda/) This platform provided for thereal-time extraction of data on changing sentiments around family planning andcontraception, which would impact on any public health programme or inter-vention addressing these concerns
• Analysing public perceptions towards sanitation by analysing social mediacontent, usingfiltered Twitter data and analysed on a social media data analyticsplatform Overall trends in data volume, influencers, and key hashtags werereported This study showed how by monitoring baseline indicators over time,the changing social media discussion around sanitation could be tracked,making it possible to evaluate the reach and effectiveness of educational cam-paigns, especially public engagement with these campaigns
Trang 15• Using data to analyse seasonal mobility patterns in Senegal where anonymisedmobile telephone data were used to indicate the position of people, and theirmovement, in order to show differences in mobility patterns over the variousseasons Movements were characterised both daily, as well as over the period of
a month Understanding where people were, and their migration patterns overthe seasons, is potentially extremely useful for health surveillance and outbreakassessments, as well as for resource and response planning
Another novel proposal is the use of new online sources of data, like social media,and combine this with epidemiological environmental data to create real-time andconstantly updating disease distribution maps that are more relevant and nuancedthan the traditional, static maps [19] This is considered key, as the accurate andup-to-date understanding of disease distribution, especially in LMICs, is central toeffective, targeted and appropriate interventions to prevent, treat, and manage dis-eases and vectors, and to understanding the global burden of disease that currentlydrives much of the investment in and deployment of public health initiatives.Two projects listed in this volume (Chaps.4 and 5) give further examples ofhow Big Data analytics may be utilized and deployed (Chap.4) in support of healthand human welfare concerns, and more detail can be found in those chapters
1.3.1 Analytical Challenges
Traditionally health data for research purposes has been collected in ways that servethe statistical analysis approaches used [2] This means that data were studied fromsamples of defined populations, extrapolated to the whole population of interestbased on very clearly defined sampling strategies In addition to this, very clearly
defined and operationalised measures were collected and the data were rigorouslyand continuously monitored for quality and accuracy These data were then care-fully ‘cleaned’ prior to any analysis In part due to all of these protections andprocedures, the cost and complexity of running big trials has grown rapidly,especially in LMICs, and there have been calls for more pragmatic approaches [20].Big Data may ameliorate some of the impact of the high expense and complexity
of these trials data quality procedures, as a clearly defined focus and data quality aretraded for quantity and variety However, the volume, velocity and variety of thesedata create some potential pitfalls to their analysis [3]:
• The development of selection algorithms for selecting patients whose EHR oradministrative data are to be used is very problematic Often this requires theanalysis of data of several different types, analogous to clinical diagnosis by ahealth professional, and errors in developing algorithms can lead to erroneousconclusions Complex, iterative approaches based on using clinical judgementsfrom practitioners have been suggested as potential solutions to this issue
Trang 16• Importantly, the data used in these analyses are observational and usually notcollected under controlled experimental or randomised conditions, and so thereexists a real worry that observations are susceptible to biases and confounding.Usually the identification of potential confounders to control for in analyses isdone by researchers and experts which means that this is not likely to happen inmany of the automated algorithms used in Big Data applications This makesinterpretation of results more difficult.
• As the datasets grow both in volume and variety, analysis techniques used tofind associations and patterns that are meaningful become complicated by theincrease in the likelihood of chance findings being significant Unless this istaken into account, the number of false positive associations is uncontrolled,leading to spurious conclusions based on chance associations
1.3.2 Ethical Challenges
Big Data approaches to biomedical and behavioural data are likely to yield
sig-nificant insights and advances for global and public health However, this comeswith some key ethical challenges that need to be identified and addressed A recentcomprehensive review of ethical issues in Big Data, which reviewed informationfrom 68 studies, highlights some key challenges, and also suggests some additionalchallenges that are not yet clearly outlined in the existing literature [6] Althoughthe technology for producing, processing and sharing data mean that large datasetsare easily available, and that linking and sharing can be quite ubiquitous, this is notwithout significant issues The example is given of Facebook’s Beacon software,released in 2007, and developed to automatically link external online purchaseswith Facebook profiles Intended to improve the level of personal advertising, whatthis service inadvertently did was to expose sensitive private characteristics such as,for example, sexual orientation, or information about items that had been purchased
as gifts The service was terminated after being the focus of litigation [21] Usingreadily available Big Data can create unanticipated consequences and ethical issues,and associations with high profile media stories about the risks of data sharing andaccess may run the risk of impacting on well-thought out, health-related uses ofsimilar data
From the literature reviewed, five key areas of consideration around ethicalissues were identified, and these will be briefly outlined [6]
1.3.2.1 Informed Consent
Traditionally informed consent for data to be used in research related to clear andunambiguous consent for the collection of specific data for the use in specific, or atleast clearly related, research studies This is not suitable for Big Data applications
Trang 17where vast amounts of novel and routine data are collected, often with the expresspurpose of creatively identifying surprising or novel associations in the vast andinterconnected datasets This means that the very concept of informed consent may
be difficult to apply to Big Data research The certainty and singular approach totraditional consent must be adapted to work in Big Data research Clear tensionexists between being able to utilise data for Big Data analysis, and the inability inmost cases to get explicit informed consent for every possible future use of thesedata This is particularly salient as the data used for these analyses are often col-lected routinely, in huge amounts, and used in analyses that could not have beenenvisaged when the data were collected Although it is well beyond the scope ofthis chapter to resolve this issue, it is a key consideration for the remainder of thisbook There are many ways this may be addressed, either pragmatically (by con-sidering the data sources as having an altruistic interest in their data being used forthe public good, or the decision about the use of data being made by identified,impartial third parties) or substantively (for example, by requiring participants toopt out of data sharing) These issues are not simply resolved by‘de-identifying’data, as illustrated in relation to privacy concerns
1.3.2.2 Privacy
More routine and personal data are being collected automatically and anonymously.This is often done with little awareness of the people on whom these data are beingcollected as to the extent and scope of information that is reasonably easilyavailable for scraping and using This is a key characteristic of the Big Data age [6]and in contrast with research data historically, which tended to focus on discrete andobvious measurements Privacy issues in Big Data are frequently with confiden-tiality, and the ease with which linking data sets can reveal associations, andpotentially identity This means that simple anonymization of data is very difficult
to attain and impossible to assure Additionally, it is clear that harm can occur notonly at the individual, but also at a population level from data collected, throughstigmatisation or discrimination To assume that anonymization and use of data atthe population level is an acceptable way to avoid requiring consent is problematic.What is key for things like social media analysis, is the concern that just becausethings occur in public does not mean that they should be viewed as freely acces-sible Nor should it be assumed that the individuals on whom the data have beencollected are able to understand how easily and widely these data can be accessedand used The fact that data are now being stored for much longer is a relatedconcern The length of time for which these personal data are being kept increasesthe potential risk that data privacy may be violated A real tension exists betweenoverly restrictive regulations and procedures that could prevent useful and helpfulresearch, and too open access and sharing of data that may too easily be used todiscriminate or cause other harms
Trang 181.3.2.3 Ownership
Data ownership can be an already complex issue, but when large and interrelateddatasets are shared, and the collection, analysis and publication of results of dataamalgamation happen in a shared space, the issue of who controls the data becomeseven more difficult How data are redistributed and who can make changes orconduct analysis can be complex issues in Big Data analysis that need to beresolved It is very difficult for individuals to control what is done with their dataand it is widely accepted that there needs to be some controls as to, for example,third parties being able to benefit commercially from data that have been collectedoutside of the agency or knowledge of those providing the data Open access toyour own data has also been widely discussed as important However, this doeshave risks as direct access to raw data could lead to misconceptions, misunder-standings orflawed analyses that create incorrect conclusions if the data are handledand analysed by individuals who lack the knowledge or skills to do this rigorously,
or to interpret the results correctly
1.3.2.4 Epistemology and Objectivity
Although expert knowledge and skills have always meant that understanding dataand its outputs is difficult, as the vastness and complexity of datasets increase, themeans of analysing these data has also changed This usually also means that it isnecessary to apply machine learning analysis techniques [2] This inverts the usualapproach in science; from specific, hypothesis driven statistical tests, we havemoved into the arena of complex machine-learning algorithms which process vastdatasets to create analyses and conclusions that may be well beyond the under-standing of those who are processing the data It is key that these findings areviewed as new hypotheses about empiric relationships rather than clear predictionsabout behaviour or outcomes [2]
A related issue is that of objectivity Because the outputs of Big Data come from
so many varied and vast datasets, there is a tendency to assume that they are
‘objective’ However, as with all analyses, the methods, the questions asked, andanalysis decisions made are all driven by positions and decisions that mean thatthere is a great deal of subjective influence over both the data, as well as the outputsand analyses Since much of these data come from routinely collected sources, oftenfor other purposes, and the collection and analyses may become routine andautomated, the lack of quality and consistency checks may lead to the dangerousposition of not questioning the validity of conclusions made on the basis of variedand non-quality checked data For example, a review of EHR has estimated thatalthough analyses may highlight key issues in patient care, from 4.3 to 86% of dataare missing, incomplete, or inaccurate [22]
Trang 191.3.2.5 Big Data‘Divides’
The collection and analysis of Big Data places new and considerable technical andresource demands on organisations, meaning that the number of these which areequipped and able to deal with these challenges is limited This is particularlysalient when considering whether or not individuals will have the means or rights toaccess their own data, or have a say about how this is used by a few large dataorganisations Those who simply choose to opt out of personal data collection, by
definition become invisible and un-represented in the datasets, which can createanother key Big Data divide In addition to this, communities that are not able toimplement EHR, will also not benefit from any insights that could be generatedthrough the analysis of the data collected on them
As with many new and promising technologies or methods, the risk exists to viewBig Data as overly beneficial and applicable to all areas of science and humanbehaviour The risk that the size and variety of the data included in Big Dataanalyses leads to a sense that these analyses are all‘objective’ and value free, ormost likely to discover ‘truths’, needs to be taken into account This very briefoutline of some of the opportunities and challenges of Big Data, especially theethical issues, outlines some of the key concerns that need to be addressed orinvestigated as this area of research develops It is envisaged that as commonstandards become established, and the numerous technical, analysis and ethicalchallenges are addressed, that Big Data in health should contribute significantly to amore personalised approach to medicine, and smarter, adaptive, health strategies[1] This would be the second wave of Big Data
This short book is organized as follows; Chapter2describes the concept of datascience and analytics Some good examples of using data science methods alsodescribed briefly Chapter3 explains the elements of Big Data The chapter illus-tratesfive components of Big Data Chapter 4 describes a real-world implemen-tation of a Big Data analytics system The chapter describes many real-worldchallenges and solutions in LMICs Also, the chapter illustrates the benefits of theapproach for patients, healthcare settings, healthcare authorities as well as com-panies that manufacture healthcare devices (especially point of care devices).Finally, Chap.5describes a case of social media data mining during Ebola outbreakand presents the valuable insights that can be extracted from social media
Trang 201 Koutkias, V., Thiessard, F.: Big data —smart health strategies Findings from the yearbook
2014 special theme Yearb Med Inform 9, 48 –51 (2014)
2 Hansen, M.M., Miron-Shatz, T., Lau, A.Y., et al.: Big data in science and healthcare: a review
of recent literature and perspectives Contribution of the IMIA Social Media Working Group Yearb Med Inform 9, 21 –26 (2014)
3 Peek, N., Holmes, J.H., Sun, J.: Technical challenges for big data in biomedicine and health: data sources, infrastructure, and analytics Yearb Med Inform 9, 42 –47 (2014)
4 Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential Health Inf Sci Syst 2, 3 (2014)
5 Amirian, P., Lang, T., Van Loggerenberg, F.: Geospatial big data for finding useful insights from machine data GIS Research UK (2014)
6 Mittelstadt, B.D., Floridi, L.: The ethics of big data: current and foreseeable issues in biomedical contexts Sci Eng Ethics 22(2), 303 –341 (2016)
7 Ginsberg, J., Mohebbi, M.H., Patel, R.S., et al.: Detecting in fluenza epidemics using search engine query data Nature 457(7232), 1012 –1014 (2009)
8 Cohen, H.: Social media de finitions http://heidicohen.com/social-media-de finition/ (2011)
9 Whiting, A., Williams, D.: Why people use social media: a uses and grati fications approach Qual Market Res Int J 16(4), 362 –369 (2013)
10 Statista: Number of worldwide social network users 2010 –2018 http://www.statista.com/ statistics/278414/number-of-worldwide-social-network-users/ Statista (2016)
11 Statista: Leading social networks worldwide as of April 2016, ranked by number of active users http://www.statista.com/statistics/272014/global-social-networks-ranked-by-number- of-users/ (2016)
12 Google: Google Flu Trends https://www.google.org/ flutrends/about/ (2016)
13 Finfgeld-Connett, D.: Twitter and health science research West J Nurs Res 37(10), 1269 –
17 UN: United Nations Global Pulse http://www.unglobalpulse.org/ United Nations (2016)
18 UN: United Nations Global Pulse Projects http://www.unglobalpulse.org/projects United Nations (2016)
19 Hay, S.I., George, D.B., Moyes, C.L., et al.: Big data opportunities for global infectious disease surveillance PLoS Med 10(4), e1001413 (2013)
20 Lang, T., Siribaddana, S.: Clinical trials have gone global: is this a good thing? PLoS Med 9(6), e1001228 (2012)
21 Welsh, K., Cruz, L.: The danger of big data: social media as computational social science First Monday 17(7), 1 (2012)
22 Balas, E.A., Vernon, M., Magrabi, F., et al.: Big data clinical research: validity, ethics, and regulation Stud Health Technol Inform 216, 448 –452 (2015)
Trang 21Data Science and Analytics
Pouria Amirian, Francois van Loggerenberg and Trudie Lang
Thanks to advancement of sensing, computation and communication technologies,data are generated and collected at unprecedented scale and speed Virtually everyaspect of many businesses is now open to data collection; operations, manufac-turing, supply chain management, customer behavior, marketing, workflow pro-cedures and so on This broad availability of data has led to increasing interest inmethods for extracting useful information and knowledge from data and data-drivendecision making Data Science is the science and art of using computationalmethods to identify and discover influential patterns in data The goal of DataScience is to gain insight from data and often to affect decisions to make them morereliable [1] Data is necessarily a measure of historic information so, by definition,Data Science examines historic data However, the data in Data Science can becollected a few years or a few milliseconds ago, continuously or in a one offprocess Therefore, Data Science procedure can be based on real-time or nearreal-time data collection
The term Data Science arose in large part due to the advancements in tational methods; especially new or improved methods in machine learning, arti-ficial intelligence and pattern recognition In addition, due to increasing thecomputational capacities through cloud computing and distributed computationalmodels, use of data for extracting useful information even in large volume is more
compu-P Amirian ( &) F van Loggerenberg T Lang
University of Oxford, Oxford, UK
e-mail: Pouria.Amirian@ndm.ox.ac.uk; Pouria.Amirian@os.uk
F van Loggerenberg
e-mail: francois.vanloggerenberg@psych.ox.ac.uk
T Lang
e-mail: trudie.lang@ndm.ox.ac.uk
© The Editors and Authors 2017
P Amirian et al (eds.), Big Data in Healthcare, SpringerBriefs in Pharmaceutical
Science & Drug Development, DOI 10.1007/978-3-319-62990-2_2
15
Trang 22affordable Nevertheless, the ideas behind Data Science are not new at all but havebeen represented by different terms throughout the decades, including data mining,data analysis, pattern recognition, statistical learning, knowledge discovery andcybernetics.
As a recent phenomenon, the rise of Data Science is pragmatic Virtually everyaspect of many organizations is now open to data collection and often eveninstrumented for data collection At the same time, information is now widelyavailable on external events such as trends, news, and movements This broadavailability of data has led to increasing interest in methods for extracting usefulinformation and knowledge from data (Data Science) and data driven decisionmaking [2] With availability of relevant data and technologies, decision makingprocedures which previously were based on experience, guesswork or on con-strained models of reality, can now be made based on the data and data products Inother words, as organizations collect more data and begin to summarize and analyze
it, there is a natural progression toward using the data to scientifically improveapproximations, estimates, forecasts, decisions, and ultimately, efficiency andproductivity
Data Science is the process of discovering interesting and meaningful patterns indata using computational analytics methods Analytical methods in the DataScience are drawn from several related disciplines, some of which have been used
to discover patterns and trends in data for more than 100 years, including statistics.Figure2.1, shows some of disciplines related to Data Science
The fact that most methods are data driven is the most important characteristic ofmethods in Data Science They try to find hidden and hopefully useful patternswhich are not based on the assumption made by the data collection procedures ormade by the analysts In other words, methods in Data Science are data-driven, andmostly explore hidden patterns in data rather than confirm hypotheses which are set
by data analysts The data-driven algorithms induce models from the data Inmodern methods in Data Science, the induction process can include identification ofvariables to be included in the model, parameters that define the model, weights orcoefficients in the model, or model complexity
Despite the large number of specific Data Science methods developed over theyears, there are only a handful of fundamentally different types of analytical tasksthese methods address In general, there are a few types of analytical tasks in DataScience which can be classified as supervised or unsupervised learning
Supervised learning involves building a model for predicting, or estimating, anoutput based on one or more inputs Problems of this nature occur in fields asdiverse as business, medicine, astrophysics, and public policy With unsupervisedlearning, there are inputs but no supervising output; nevertheless, we can learnrelationships and structure from such data [3] Following sectionsfirst introduce the
Trang 23concept of supervised and unsupervised learning in more depth, and then give briefdescription of major analytical tasks in Data Science.
Algorithms or methods in the Data Science try learn from data Most of time, dataneed to be in a certain shape or structure in order to be used in a Data Sciencemethod Mathematically speaking usually data need to be in form of a matrix Rows(records) in the matrix represents data points or observations and columns representvalues for various attributes in an observation In many Data Science problems,the number of rows is higher than the number of attributes However, it is quitecommon to see higher number of attributes in problems like gene sequencing andsentiment analysis In some problems an attribute is called target variable since theData Science methods tries tofind a function for estimation of the target variablebased on other variables in data The target variable also can be called response,dependent variable, label, output and outcome In this case other attributes in thedata are called independent variables, predictors, features or inputs [4]
Algorithms for Data Science are often divided into two groups: supervisedlearning methods and unsupervised learning methods Suppose a dataset that iscollected in a controlled trail Data in this dataset consists of attributes like id, age,Fig 2.1 Methods in Data Science are drawn from many disciplines
Trang 24sex, BMI, life style, years of education, income, number of children, and respond todrug Consider two similar questions one might ask about a health condition ofsample of patients Thefirst is: “Do the patients naturally fall into different groups?”Here no specific purpose or target has been specified for the grouping When there
is no such target, the data science problem is referred to as unsupervised learning.Contrast this with a slightly different question:“Can we find groups of patients whohave particularly high likelihoods of positive response for a certain drug?” Herethere is a specific target defined: will a newly admitted patient (who did not takepart in the trial) respond to certain drug? In this case, segmentation is being done for
a specific reason: to take action based on likelihood of response to drug In otherwords, response to the drug is the target variable in this problem, and a specificData Science tasks tries to find the attributes which have impact on the targetvariable and more importantly their importance in predicting the target value This
is called a supervised learning problem
In supervised learning problems, the supervisor is the target variable, and thegoal is to predict the target variable from other attributes in the data The targetvariable is chosen to represent the answer to a question an analyst or an organi-zation would like to answer In order to build a supervised learning model, thedataset needs to contain both target variables as well as other attributes After themodel is created based on existing data, the model can be used for predicting atarget value for a dataset without target variables That is why sometimes super-vised learning is also called predictive modeling The primary predictive modelingalgorithms are classification for categorical target variables (like yes/no) or re-gression for continuous target variables (numeric values) Examples of targetvariables include whether a patient responded to a certain drug (yes/no), the amount
of a treatment (120, 250 mg, etc.), if a tumor size increased in 6 months (yes/no)and probability of increase in tumor size (0–100%)
In unsupervised learning, the model has no target variable The inputs areanalyzed and grouped or clustered based on the proximity or similarity of inputvalues to one another Each group or cluster is given a label to indicate which group
a record belongs to
In addition to the typical statistical analysis tasks (like causal modelling) in thecontext of healthcare, there are several analytical tasks in healthcare from a DataScience point of view The analytical tasks can be categorized as regression,classification, clustering, similarity matching (recommender systems), profiling,simulation and content analysis
Regression tries to estimate or predict a target value for numerical variables Anexample regression question would be:“How much will a given customer use thehealth insurance service?” The target variable to be predicted here is healthinsurance service usage, and a model could be generated by looking at other, similar
Trang 25individuals in the population (from health condition and records point of view).
A regression procedure produces a model that, given a set of inputs, estimates thevalue of the particular variable specific to that individual
While regression algorithms are used to predict target variables with numericaloutcomes, classification algorithms are utilized for predicting the target variablewith finite categories (classes) Classification and class probability estimationattempt to predict, for each individual in a population, which of a set of classes theindividual belongs to Usually the classes are mutually exclusive An exampleclassification question would be: “Among all the participants in a particular trial,which are likely to respond to a given drug?” In this example the two classes could
be called“will respond” (or positive) and “will not respond” (or negative) For aclassification task, the Data Science procedure produces a model that, given a newindividual, determines which class that individual belongs to A closely related task
is scoring or class probability estimation A scoring model applies to an individualand produces a score representing the probability that the individual belongs to eachclass In the trial, a scoring model would be able to evaluate each individualparticipant and produce a score of how likely each is to respond to the drug Bothregression and classification algorithms are used for solving supervised learningproblems, meaning that the data need to have target variables before the modelbuilding process begins Regression is to some extent similar to classification, butthe two are different Informally, classification predicts whether something willhappen, whereas regression predicts how much something will happen The clas-
sification and regression compose core of predictive analytics Nowadays, muchwork is focusing now on predictive analytics, especially in clinical settingsattempting to optimize health andfinancial outcomes [5]
Clustering uses unsupervised learning to group data into distinct clusters orsegments In other words, clustering tries tofind natural grouping in the data Anexample clustering question would be: “Do the patients form natural groups orsegments?” Clustering is useful in preliminary domain exploration to see whichnatural groups exist because these groups in turn may suggest other Data Sciencetasks or approaches A major difference between clustering and classificationproblems is that the outcome of clustering is unknown beforehand and need humaninterpretation and further processing In contrast, outcome of classification for anobservation is a membership or probability of membership in a certain class.The fourth type of analytical task in Data Science is similarity matching.Similarity matching attempts to identify similar individuals based on available data.Similarity matching can be used directly tofind similar entities based on criteria.For example, a health insurance company is interested infinding similar individ-uals, in order to offer them most efficient insurance policies They use similaritymatching based on data describing health characteristics of the individuals.Similarity matching is the basis for one of the most popular methods for creatingrecommendations engines or recommender systems Recommendation engineshave been used extensively by online retailers like Amazon.com to recommendproducts based on users’ preferences and historical behavior (browsing behaviorand past purchases) The same concepts and techniques can be used for
Trang 26recommending or improving healthcare services to patients In this case, there aretwo broad approaches for implementation of recommender systems Collaborationfiltering makes recommendations based on similarities between patients or services(like treatments) they used The second class of recommendation engines can beused to make recommendations by analyzing the content of data related to eachpatient In this case, text analytics or natural language processing techniques can beused on the electronic health reports/records of the patients after each visit to thehospital Similar content types are grouped together automatically, and this canform the basis of recommendations of new treatments to new similar patients.Profiling (also known as behavior description) tries to characterize the typicalbehavior of an individual, group, or population An example profiling questionwould be: “What is the typical health insurance usage of this patient segment(group)?” Behavior may not have a simple description Behavior can be assignedgenerally over an entire population, or down to the level of small groups or evenindividuals Profiling is often used to establish behavioral norms for anomalydetection applications such as fraud detection For example, if we know what kind
of medicine a patient typically has on his/her prescriptions, we can determinewhether a new medicine on new prescriptionfits that profile or not We can use thedegree of mismatch as a suspicion score and issue an alarm if it is too high Alsoprofiling can help address the challenge of health care hotspotting which is findingpeople who use an excessive amount of health care resources
Simulation techniques are widely used across many domains to model andoptimize processes in the real world Engineers have long used mathematicaltechniques simulate evacuation planning of large buildings Simulation savesengineeringfirms millions of dollars in research and development costs since they
no longer have to do all their testing with real physical models In addition, ulation offers the opportunity to test many more scenarios by simply adjustingvariables in their computer models In healthcare, simulation can be used in widevariety of applications; from modelling disease spread to optimizing wait times inhealthcare settings
sim-Content analysis is used to extract useful information from unstructured datasuch as textfiles, images, and videos In this context, text analytics or text mininguses statistical and linguistic analysis to understand the meaning of text, or tosummarize a long text, or to extract sentiment of feedbacks (like online review for ahealthcare service or center) In all these practical applications, simple keywordsearching is too primitive and inefficient For example, to detect an outbreak of adisease (likeflu) from real-time feeds from a social media like twitter, with a simplekeyword search it is necessary to collect and store all relevant keywords about thedisease (like symptoms, treatments, etc.) and their importance This is a manual andlaborious process Even with all relevant keywords, simple keyword search cannotoffer any useful information since those keywords, can be used in other contexts Incontrast to the simple keyword search, techniques in text analytics and naturallanguage processing can be used to filter out irrelevant contents and infer themeaning of group of words based on context Machine learning, signal processingand computer vision also offer several tools for analyzing images and videos
Trang 27through pattern recognition Through pattern recognition, known targets or patternscan be identified to aid analysis of medical images.
Business Intelligence and Data Mining
In general, Data Science, analytics and even data mining are the same Data Mining
is considered the predecessor to Analytics and Data Science Data Science has much
in common with data mining since the algorithms and approaches for preparation ofdata and extracting useful insights from data in both, are generally the same.Analytics, on the other hand, is more focused on the methods for finding anddiscovering useful patterns in data and has less coverage about data preparation [6,
7] In this case Analytics is an important part of any Data Science procedure.However, one can argue that in order to do Analytics, data need to be collected andprepared before the modelling stage In this context, Analytics is the same thing asthe Data Science In this book, Data Science and Analytics are used interchangeably
Data Science and statistics have considerable overlap with statisticians even arguingthat Data Science is an extension of statistical learning In fact, statistical learning andmachine learning methods are highly similar and in most cases the line between thesetwo has been blurred recently In a nutshell, differences between Data Science andstatistical learning are highly related to the mindset of analyst and their background.However, as the core of statistical learning, statistics is often used to performconfirmatory analysis where a hypothesis about a relationship between inputs and
an output is made, and the purpose of the analysis is to prove or reject the tionship and quantify the degree of that confirmation or denial using some statisticaltests [8] In this context, many analyses are highly structured, such as determining if
rela-a drug is effective in reducing the incidence of rela-a prela-articulrela-ar diserela-ase
In statistics, controls are essential to ensure that bias is not introduced into themodel, thus misleading the interpretation of the model Most of the time, inter-pretability of statistical models and their accuracy are important in understandingwhat the data are saying, and therefore great care is taken to transform the modelinputs and outputs so they comply with assumptions of the modeling algorithms Inaddition, much effort is put into interpretting the errors as well [9]
Data Science, on the other hand, often shows little concern forfinal parameters
in the models except in very general terms The key is often the accuracy of the
Trang 28model and, therefore, the ability of the model to have a positive impact onthe decision making process [10] In contrast to the structured problem beingsolved through confirmatory analysis using statistics, Data Science often attempts tosolve less structured business problems using data that were not even collected forthe purpose of building models; the data just happened to be around [1] Controlsare often not in place in the data and therefore causality, very difficult to uncovereven in structured problems, becomes exceedingly difficult to identify.
Data Scientists frequently approach problems in more unstructured, even casualmanner The data, in whatever form it is found, drives the models This is not aproblem as long as the data continues to be collected in a manner consistent withthe data as it was used in the models; consistency in the data will increase thelikelihood that there will be consistency in the model’s predictions, and thereforehow well the model affects decisions
In summary, statistical learning is more focused on models but in Data Science,data are driving the modelling procedure [11]
Another field which has a considerable overlap with Data Science is BusinessIntelligence (BI) The output of almost all BI analyses are visualizations, reports ordashboards that summarize interesting characteristics and metrics of the data, oftendescribed as Key Performance Indicators (KPIs) The KPI reports are user-drivenand case-based and determined by a domain experts to be used by the decisionmakers These reports can contain simple descriptive summaries or very complex,multidimensional measures about real-time events
Both Data Science and BI use statistics as a computational framework However,the focus on BI is to explain what was happened in the business or what is hap-pening in the business Based on these observations, decision makers can takeappropriate actions
Data Science also uses historic data or data that have been collected In contrast
to BI, Data Science is focused more on finding patterns in terms of models fordescribing the target variable based on inputs In other words, predictive analytics isnot part of BI but is at the heart of Data Science This leads to the fact that DataScience can provide more valuable insights for decision makers than BI can
The procedure of a Data Science project need to be structured and well defined inorder to minimize the risks As it mentioned before, the goal of Data Science is tofind useful and meaningful insight from data This goal also is goal of KnowledgeDiscovery in Databases (KDD) process KDD is an iterative and interactive process
Trang 29of discovering valid, novel, useful, and understandable knowledge (patterns,models, rules etc.) in massive databases [12] Fortunately, both Data Science andKDD have well-defined steps and tasks for conducting projects.
Like Data Science, KDD includes multidisciplinary activities Activities in KDDentail integrating data from multiple sources, storing data in a single scalablesystem, preprocess data, apply data mining methods, visualization and interpretingresults Following figure illustrates multiple steps involved in an entire KDDprocess
As it illustrated in Fig.2.2, data warehousing, data mining, and data tion are major components of a KDD process
Similar to KDD process, CRISP-DM (CRoss-Industry Standard Process for DataMining) process defines and describes major steps in a Data Science process TheCRISP-DM is the most widely used data mining process model since its inception
in the 1990s [13]
For Data Scientists, the step-by-step process provides well-defined structure foranalysis and not only reminds them of the steps that need to be accomplished,but also the need for documentation and reporting throughout the process.Fig 2.2 Knowledge Discovery in Databases (KDD) Process
Trang 30The documentation in Data Science process is highly valuable because of disciplinary nature of it; as serious Data Science projects are done in a Data Scienceteam composed of team members with different backgrounds In addition, theCRISP-DM provides common terminology for Data Science teams.
multi-The six steps in the CRISP-DM process are shown in Fig.2.3: BusinessUnderstanding, Data Understanding, Data Preparation, Modeling, Evaluation, andDeployment These steps, and the sequence they appear in the Fig.2.3, representthe most common sequence in a Data Science project
Data is at the core of CRISP-DM process In a nutshell, the process starts withsome questions which need domain understanding to define the scope, goal andimportance of the project Then relevant data are collected and examined to identifythe potential problems in the data as well as to understand the characteristics ofdata Before doing any analytics, the data need to be prepared to identify andfixproblems and issues in the data At this stage data are ready to be used in DataScience process Data Scientists often use various models for a same analyticaltasks So based on the questions and its required performance, models are gener-ated, evaluated and then expected effects and limitations of each model are docu-mented Finally, the best model based on success criteria, is going to be deployed inproduction environment to be used in real-world applications
Note the feedback loops in thefigure These indicate the most common ways thetypical Data Science process is modified based on findings and results of each step
Fig 2.3 CRISP-DM Process
Trang 31during the project For example, if process objectives have been defined duringbusiness understanding, then data are examined during data understanding At thisstage, if it turns out that there is insufficient data quantity or data quality to buildpredictive models and it is not feasible to collect more data with higher quality,business objectives must be redefined with the available data before proceeding todata preparation and modeling As another example, if a built models have insuf-ficient performance, data preparation task need to be done again to create newderived variables based on transformation on or interactions between existingvariables to improve the models’ performance.
Every Data Science project needs objectives before any data collection, preparation,and modelling tasks Domain experts who understand needs, requirements, decisions,strategies and can understand the value of data must define these objectives DataScientists themselves sometimes have this expertise, although most often, managersand directors have a far better perspective on how models affect the organization [14]
In research settings, researchers always understand the problems therefore withenough domain knowledge they can define objectives of a Data Science project.Domain knowledge in this step is very important Without domain expertise, the
definitions of what models should be built and how they should be assessed can lead tofailed projects that don’t address the key business concerns [1,15]
Unfortunately, most of data in healthcare industry are not suitable for many kinds ofanalytical tasks Often 90% of the work in a Data Science project (especially inhealthcare) is getting the data in a form in which it can be used in analytical tasks.More specifically, there are two major issues associated with existing data inhealthcare First, a large number of medical records are still either hand-written or indigital formats that are slightly better than hand-written records (such as photographs
or scanned images of hand-written records or even scanned images of printedreports) Getting medical records into a format that is computable is a prerequisite foralmost any kind of progress in current state of healthcare settings from analyticalpoint of view [16] The second issue related to isolated state of the existing datasources In other words, existing digital data sources cannot be combined and linkedtogether These two issues can be resolved with standard electronic health recordsconcept that is patient data in a standard form that can be shared efficiently betweenvarious electronic systems and that can be moved from one location to another at thespeed of the Internet [16] While there are currently hundreds of different formats forelectronic health records, the fact that they are electronic means that they can be
Trang 32converted from one form into another Standardizing on a single format would makethings much easier, but just getting the data into some electronic form is thefirststep Once all data are stored in electronic health records, it is feasible to link generalpractitioners’ offices, labs, hospitals, and insurers into a data network, so that allpatient data are immediately stored in a logical data store (but physically multipledata stores) At this point data is ready to be prepared for the analytical tasks.Most analytical tasks need data in two-dimensional format, composed of rowsand columns Each row represents what can be called a unit of analysis This isslightly different than unit of observations and measurements Generally, data arecollected from different sources with unit of observation in mind but then in thedata preparation step, transformed to units of analysis In healthcare, a unit ofanalysis is typically a patient, or test results for patients [17] The unit of analysis isproblem-specific and therefore is defined as part of the business understanding step
of Data Science process
Understanding data entails generating lots of plotting and examining the tionship between various attributes Columns in the data are often called attributes,variables,fields, features, or just columns Columns contain values for each unit ofanalysis (rows) For almost all Data Science methods, the number of columns and theorder of the columns must be identical from row to row in the data In data under-standing step, missing values and outliers need to be identified Typically, if a featurehas over 40% of missing values, it can be removed from dataset, unless the featureconveys critical information [18] For example, there might be a strong bias in thedemographics of whofills in the optional field of “age” in a survey and this is animportant piece of information There are several ways for handling missing values.Typically, the missing values can be replaced with the average, median or even someother computations based on values of the same features in other records This iscalled feature imputation Some important models in Data Science (like tree-basedensemble models) generally can handle missing values Similar to missing valuesthere are some standard statistical ways for identification and handling outliers indata It is important that identification and handling missing values and outliers aredocumented in this step of Data Science process Also data type of attributesdetermines necessary steps in their preparation For predictive modelling (supervisedlearning) it is necessary to identify one or more attributes as target variable.Identification of target variable is usually done in first step of a Data Science process(business understanding) The target variable can be numeric or categoricaldepending on the type of model that will be built in next step At the end of this step,data is ready to be used for building models and testing their performance
Based on type of questions, analytical tasks of the Data Science project (classication, clustering, simulation, regression and so on) can be determined.For example, if there is a target variable in the question (“which participants are
Trang 33fi-likely to respond to a given drug in a trial?”), the business question need to beanswered with a supervised learning task If the target variable is of type cate-gorical, learning problem is a classification (“positive/negative response to thedrug”) If the target variable is numeric, the learning problem is regression Thereare many algorithms that can be used in classification or regression or in both Eachalgorithm has its own assumption Since the most widely used types of DataScience tasks in healthcare are classification and regression [19] following part ofthis section focuses on predictive analytics.
Regardless of algorithm for predictive analytics task, the data are split into twosets; a training set and a test set A training set is used for building the model (forexample finding the coefficients of features which best describe the variability intraining set) A test set is used for evaluation of performance of the built model.Percentage of splitting of the data depends on the size of data If the dataset is largeenough, training and test sets can have a similar number of rows Typically,
60–80% of data is used for training the model
As it mentioned before, a predictive model is built with values of the trainingset For evaluating the performance of model, a test set is used In other words,the result of applying model building step to training set is a trained model whichcan be used for prediction The test set is not used in the model building step Forevaluating the model performance, the test set is used as input for the model Afterapplying the model to the test set, the test set has two values (two columns) for thetarget variable; one is the actual value and the other is the result of applying thepredictive model (predicted value) At this stage (which is called scoring), differ-ences between actual and predicted values for test set can be used for evaluatingperformance of the model
Often algorithms in Data Science have hyper parameters Values of hyperparameters impact the model performance In the Data Science process,finding agood value for a hyper parameter (model tuning) is done by examining differentvalues for each hyper parameter and then calculating the model performance.Usually a range of values needs to be tested for various hyper parameters (forexample using exhaustive grid search or random search) This process of buildingmodel is iterative (Fig.2.4) This step typically results in evaluating many modelsbased on their performance However, the performance of the model, is one ele-ment of success criterion of the Data Science process Hyper parameters will bediscussed later in the context of a regression task
Most of time, in Data Science projects, the success criterion is more importantthan the model assumptions In other words, the determination of what is consid-ered a good model depends on the particular interests of the project and is specified
as the success criterion The success criterion needs to be converted to a quantifiablemetric so the Data Scientist can use it for selecting models Often success criterion
is a quantity for percentage of improvements in a previous modelling process like10% improvements in prediction of malignant tumor with 30% less cost.Sometimes the success criterion is doing a task automatically using a Data Sciencemethod and the success metric for that is whether the Data Science process iscomputationally and economically feasible
Trang 34If the purpose of the predictive model is to provide highly accurate predictions ordecisions to be used by the decision makers, measures of accuracy (performance)will be used If interpretation of the model is of most interest, accuracy measureswill be used for certain models which are interpretable In other words, not allmodels in Data Science have meaningful interpretations In this case, higheraccuracy models with difficult (or no) interpretation will not be included in finalmodel evaluation if transparency and interpretation are more important than ac-curacy of prediction In addition, subjective measures of what provides maximuminsight may be most desirable These subjective measures are often defined based
on ease of implementation (from development time, expenses and migration ofexisting platforms points of view) and ease of description of the model Someprojects may use a combination of both so that the most accurate model is notselected if a less accurate but more transparent model with nearly the sameacceptable accuracy is available
For classification problems, the most frequent metric to assess model mance is accuracy of the model which is percentage of correct classification withoutregard to what kind of errors are made In addition to the classification model,another result of applying a classification model is the confusion matrix Figure2.5,illustrates a result of confusion matrix for detection of malignant tumor
perfor-In this case the overall accuracy (or accuracy) of model is (10 + 105)/(10 + 5 +
17 + 105) = 84% In addition to overall accuracy, the confusion matrix canFig 2.4 Building model procedure
Trang 35provide a different measure of performance, like sensitivity, precision, fall out andF1 score Figure2.6, illustrates calculation of various performance measures based
on the confusion matrix The performance metrics from confusion matrix are goodwhen an entire population must be scored and acted on For example, for makingdecision about providing customized service for all hospital visitors
If the classification model intended for a subset of the population, for example
by prioritizing patients, by sorting the patients based on a model score and acting
on only a portion of those entities in the selected patients, other performancemetrics can be accomplished such as ROC (Receiver Operator Characteristics), andArea under the Curve (AUC) ROC curves typically feature true positive rate on the
Y axis, and false positive rate on the X axis This means that the top left corner ofthe plot is the ideal point for classification (a false positive rate of zero, and a truepositive rate of one) The area under the ROC curve, is AUC A larger AUC usuallymeans higher performance The steepness of ROC curves is also important, since it
is ideal to maximize the true positive rate while minimizing the false positive rate.Figure2.7, shows the ROC diagram for the classification problem
Fig 2.5 Confusion matrix in a classi fication problem
Fig 2.6 Various Performance Metrics based on Confusion Matrix
Trang 36For regression problems, the model training and scoring method are similar tothe classification problems In the following paragraphs, model building, hyperparameter identification and performance metrics calculation are described usingsimple linear regression and a powerful penalized linear regression model.
As mentioned before, regression problems are classified as supervised learning
or predictive analytics problems In supervised learning, the initial dataset has labels
or known values for a target variable The initial dataset is usually divided totraining and test datasets forfitting the model to data and assessing the accuracy ofprediction respectively Linear regression or Ordinary Least Squares (OLS) is avery simple approach for predicting a quantitative response Linear regression hasbeen around for a long time and is the topic of innumerable textbooks Though itmay seem somewhat dull compared to some of the more modern approaches inData Science, linear regression is still a useful and widely used statistical learningmethod It assumes that there is approximately a linear relationship between X and
Y Mathematically, the relationship between X and Y can be wrote as Eq.2.1 In
Eq.2.1, given a vector of features XT= (X1, X2,…, Xp), the model can predict theoutput Y (also known as response, dependent variable, outcome or target) via themodel:
Y ¼ f Xð Þ ¼ b0þX
p
j ¼1
Equation 2.1Linear Regression Model (Ordinary Least Squares)
Fig 2.7 ROC curve for tumor identi fication problem (AUC = 0.83)
Trang 37The termb0is the intercept in statistical learning, or bias in machine learning.The bj’s are unknown parameters or coefficients The Xj are used for make pre-diction and are known as features, predictors, independent variables or inputs Thevariables Xjcan be quantitative inputs (such as measurements or observations likebrain tumor size, type, and symptoms), transformations of quantitative inputs (such
as log, square-root or square of observations inputs), or basis expansions, such as
X2= X21, X3= X31, leading to a polynomial representation or dummy variables forrepresenting categorical data (like gender Male/Female) or interactions betweenvariables, for example, X3= X1 X2 Also it might seems that the model can benon-linear (by including X2 or X3), no matter the source of the Xj, the model islinear in the parameters [3,9] The OLS is widely used method for estimating theunknown parameters in a linear regression model by minimizing the differencesbetween target values in test dataset and the target values predicted by the linearapproximation function In other words, the least squares approach chooses ^bj tominimize the RSS (Residual Sum of Squares of errors)
Equation 2.2Residual Sum of Squares of errors (n is the number of observations
or rows in training dataset)
In Eq.2.2, the byi is the predicted (estimated) value for xivector (x1,x2,…xp).Residual Standard Error (RSE) is an estimate of the standard deviation of errors.More specifically, it is the average amount that the response will deviate from thetrue regression line It is computed using the following formula:
RSE¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
n 2RSS
r
ð2:3Þ
Equation 2.3Residual Standard Error (RSE)
The RSE is considered a measure of the lack offit of the model to the data If thepredictions obtained using the model are very close to the true outcome values thenRSE will be small, and it can be concluded that the modelfits the data very well Onthe other hand, ifbyiis very far from yifor one or more observations, then the RSEmay be quite large, indicating that the model doesn’t fit the training data well.The RSE provides an absolute measure of lack offit of the model to the data Butsince it is measured in the units of Y, it is not always clear what constitutes a goodRSE especially when comparing performance of the same model on differentdatasets The R2(R squared or coefficient of determination) statistic provides analternative measure offit It takes the form of a proportion and is independent of thescale of Y
Trang 38R2¼TSS RSS
TSS ¼ 1 RSS
Equation 2.4R2statistics or coefficient of determination
In Eq.2.4, TSS is the total sum of squares which can be calculated with Eq.2.5
TSS¼X
n
i ¼1
Equation 2.5Total Sum of Squares of errors (TSS)
TSS measures the total variance in the response Y, and is the amount of ability inherent in the response before the regression is performed In contrast, RSSmeasures the amount of variability that is left unexplained after performing theregression Hence, TSS– RSS measures the amount of variability in the responsethat is explained (or removed) by performing the regression, and R2measures theproportion of variability in Y that can be explained using X [3] A R2statistic that isclose to 1 indicates that a large proportion of the variability in the response has beenexplained by the regression A R2 near 0 indicates that the regression did notexplain much of the variability in the response
vari-While easy to solve the minimization problem of linear regression, it is veryprone to overfitting (high variance) In order to overcome the overfitting potential oflinear regression, in penalized linear regression an additional penalty term is added
to Eq.2.1, which force the problem to balance the conflicting goal of minimizingthe squared of errors and the penalty term As an example of penalized linearregression LASSO (Least Absolute Shrinkage and Selection Operator) adds apenalty term that is called‘1 norm (Eq.2.6) The penalty term is sum of absolutevalues of coefficients The ‘1 norm provides variable selection and results in sparsecoefficients [20] (some of unimportant features might have coefficient value ofzero)
j ¼1bj
Equation 2.6LASSO penalized linear regression model
The LASSO algorithm is computationally efficient; calculating the full set ofLASSO models requires the same order of computation as ordinary least squareshowever it provides higher accuracy than the OLS regression [21] In Eq.2.6, thek
is a hyper parameter As it mentioned before, many algorithms in Data Science havehyper parameters In order tofind a good value for the hyper parameters, usually arange of values need to be tested for various hyper parameters (for example usingexhaustive grid search or random search) Scatter plots of metrics (like errors) andvalues of a hyper parameter, can be useful for identifying a potential good range ofhyper parameters Figure2.8, shows error plot for a hyper parameter for a LASSO
Trang 39model As you can see in the Fig.2.8, values around 0.01 for k results in siderably lower RSS.
Once the best model based on success criteria is found (built), thefinal model has to
be deployed in production where it can be used by other applications to drive realdecisions It is worth noting that after tuning the model (in the previous step), inbuilding thefinal model, all data are used for training the model In other words, forbuilding the model and evaluating the model performance the whole dataset needs
to be divided into training and test sets After identification of the best model (bybuilding various models and assessing the accuracy metrics like R2for regressionand accuracy for classification), the whole dataset will be used for building the finalmodel
Models can be deployed in many different ways depending on the hostingenvironment In most cases, deploying a model involves implementing the datatransformations and predictive algorithm developed by the data scientist in order tointegrate with an existing information management system or a decision supportplatform
Fig 2.8 RSS for a regression problem In this figure, a penalized regression model (LASSO) is used for estimating (predicting) survival rate based on tumor measurements in a certain type of brain cancer Red dots show the values tested for hyper parameter Vertical blue line shows the minimum value for RSS and its corresponding k
Trang 40Model deployment usually is a cumbersome process for large projects.Developers are typically responsible for deploying the model and translating theData Science pipeline to production ready code Since developers and DataScientists usually work with different programming languages, development envi-ronments, coding lifecycle and mindset, the model deployment can be error proneand cumbersome It needs careful testing procedures to prevent wrong translation of
a Data Science pipeline and at the same time ensuring about non-functionalrequirements of the system like scalability, security and reliability
Recently, some cloud computing providers have extended their service offering
to Data Science For example, Microsoft’s Azure Machine Learning (AzureML)[22–24] dramatically simplifies model deployment by enabling data scientists todeploy theirfinial models as web services that can be invoked from any application
on any platform, including desktop, smartphone, mobile and wearable devices.Figure2.9summarizes major steps and activities in CRISP-DM process
There are a large number of programming languages, software and platforms forperforming various tasks in a Data Science project Based on Oriely’s Data ScienceSurvey 2015, Python, R, Microsoft Excel and Structured Query Language(SQL) are most widely used tools among data scientists [25] In addition to R andPython, other popular programming languages in Data Science projects are C#,Java, MATLAB, Perl, Scala and VB/VBA Relational databases are the mostcommon systems for storage, management and retrieval of data (using SQL orSQL-based languages like T-SQL) Most popular relational databases in DataScience are MySQL, MS SQL Server, PostgreSQL, Oracle and SQLite In additionFig 2.9 CRISP-DM Steps and Tasks