1. Trang chủ
  2. » Thể loại khác

Machine learning in medicine

132 46 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 132
Dung lượng 8,59 MB
File đính kèm 58. Machine Learning in Medicine.rar (7 MB)

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Part I Cluster Models 1 Data Mining for Visualization of Health Processes 150 Patients with Pneumonia.. Chapter 1Data Mining for Visualization of Health Processes 150 Patients with Pneum

Trang 1

SPRINGER BRIEFS IN STATISTICS

Trang 2

SpringerBriefs in Statistics

Trang 3

More information about this series at http://www.springer.com/series/8921

Trang 4

Ton J Cleophas • Aeilko H Zwinderman

Machine Learning

Three

123

Trang 5

Academic Medical CenterLeiden

The Netherlands

ISSN 2191-544X ISSN 2191-5458 (electronic)

ISBN 978-3-319-12162-8 ISBN 978-3-319-12163-5 (eBook)

DOI 10.1007/978-3-319-12163-5

Library of Congress Control Number: 2013957369

Springer Cham Heidelberg New York Dordrecht London

© The Author(s) 2014

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher ’s location, in its current version, and permission for use must always

be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Additional material to this book can be downloaded from http://extras.springer.com/

Trang 6

The amount of medical data is estimated to double every 20 months, and cliniciansare at a loss to analyze them Fortunately, user friendly statistical software has beenhelpful for the past 30 years However, traditional statistical methods have difficulty

to identify outliers in large datasets, and tofind patterns in big data and data withmultiple exposure/outcome variables In addition, analysis-rules for surveys andquestionnaires, which are currently common methods of medical data collection,are, essentially, missing Fortunately, a new discipline, machine learning, is able tocover all of these limitations It involves computationally intensive methods likefactor analysis, cluster analysis, and discriminant analysis It is currently mainly thedomain of computer scientists, and is already commonly used in social sciences,marketing research, operational research, and applied sciences It is little used inmedical research, probably due to the traditional belief of clinicians in clinical trialswhere multiple variables even out by the randomization process, and are not takeninto account In contrast, modern medical computerfiles often involve hundreds ofvariables like genes and other laboratory values, and computationally intensivemethods are required

In the past 2 years we have completed a series of three textbooks entitledMachine Learning in Medicine Part One, Two, and Three (ed by SpringerHeidelberg Germany, 2012–2013) Also, we produced two-100 page cookbooks,entitled Machine Learning in Medicine—Cookbook One and Two These cook-books were

(1) without background information and theoretical discussions,

(2) highlighting technical details,

(3) with data examples available at extras.springer.com for readers to performtheir own analyses,

(4) with references to the above textbooks for those wishing backgroundinformation

The current volume, entitled Machine Learning in Medicine—Cookbook Threewas written in a way much similar to that of thefirst two, and it reviews concisedversions of machine learning methods so far, like spectral plots, Bayesian networks,

v

Trang 7

support vector machines (Chaps 9, 12,13) Also, a first description is given ofseveral new methods already employed by technical and market scientists, and oftheir suitabilities for clinical research, like ordinal scalings for inconsistent intervals,loglinear models for varying incident risks, iteration methods for cross-validations(Chaps.4–6,16).

Additional new subjects are the following Chapter1 describes a novel methodfor data mining using visualization processes instead of calculus methods Chapter2

describes the use of trained clusters, a scientifically more appropriate alternative fortraditional cluster analysis Chapter11 describes evolutionary operations (evops),and the evop calculators, already widely used in chemical and technical processimprovements

Similar to thefirst cookbook, the current work will assess in a nonmathematicalway the stepwise analyses of 20 machine learning methods, that are, likewise, based

on three major machine learning methodologies:

Cluster Methodologies (Chaps.1,2),

Linear Methodologies (Chaps.3–8),

Rules Methodologies (Chaps.9–20)

Inextras.springer.comthe data files of the examples (14 SPSS files) are given(both real and hypothesized data) Furthermore, 4 csv type excelfiles are availablefor data analysis in the Konstanz Information Miner, a widely approved freemachine learning software package on the Internet since 2006

The current 100-page book entitled Machine Learning in Medicine—CookbookThree, and its complementary“Cookbooks One and Two” are written as trainingcompanions for 60 important machine learning methods relevant to medicine Weshould emphasize that all of the methods described have been successfully applied

in the authors’ own research

Lyon, August 2014 Ton J Cleophas

Aeilko H Zwinderman

vi Preface

Trang 8

Contents of Previous Volumes

5 Generalized Linear Models for Outcome Prediction with Paired Data

6 Generalized Linear Models for Predicting Event-Rates

7 Factor Analysis and Partial Least Squares (PLS) for Complex-Data Reduction

8 Optimal Scaling of High-sensitivity Analysis of Health Predictors

9 Discriminant Analysis for Making a Diagnosis from Multiple Outcomes

10 Weighted Least Squares for Adjusting Efficacy Data with Inconsistent Spread

11 Partial Correlations for Removing Interaction Effects from Efficacy Data

12 Canonical Regression for Overall Statistics of Multivariate Data

Rules Models

13 Neural Networks for Assessing Relationships that are Typically Nonlinear

14 Complex Samples Methodologies for Unbiased Sampling

15 Correspondence Analysis for Identifying the Best of Multiple Treatments

in Multiple Groups

16 Decision Trees for Decision Analysis

17 Multidimensional Scaling for Visualizing Experienced Drug Efficacies

vii

Trang 9

18 Stochastic Processes for Long Term Predictions from Short Term Observations

19 Optimal Binning for Finding High Risk Cut-offs

20 Conjoint Analysis for Determining the Most Appreciated Properties of cines to be Developed

Cluster Models

1 Nearest Neighbors for Classifying New Medicines

2 Predicting High-Risk-Bin Memberships

3 Predicting Outlier Memberships

Linear Models

4 Polynomial Regression for Outcome Categories

5 Automatic Nonparametric Tests for Predictor Categories

6 Random Intercept Models for Both Outcome and Predictor Categories

7 Automatic Regression for Maximizing Linear Relationships

8 Simulation Models for Varying Predictors

9 Generalized Linear Mixed Models for Outcome Prediction from Mixed Data

10 Two Stage Least Squares for Linear Models with Problematic Predictors

11 Autoregressive Models for Longitudinal Data

Rules Models

12 Item Response Modeling for Analyzing Quality of Life with Better Precision

13 Survival Studies with Varying Risks of Dying

14 Fuzzy Logic for Improved Precision of Pharmacological Data Analysis

15 Automatic Data Mining for the Best Treatment of a Disease

16 Pareto Charts for Identifying the Main Factors of Multifactorial Outcomes

17 Radial Basis Neural Networks for Multidimensional Gaussian Data

18 Automatic Modeling for Drug Efficacy Prediction

19 Automatic Modeling for Clinical Event Prediction

20 Automatic Newton Modeling in Clinical Pharmacology

viii Contents of Previous Volumes

Trang 10

Part I Cluster Models

1 Data Mining for Visualization of Health Processes

(150 Patients with Pneumonia) 3

1.1 General Purpose 3

1.2 Primary Scientific Question 3

1.3 Example 3

1.4 Knime Data Miner 5

1.5 Knime Workflow 6

1.6 Box and Whiskers Plots 7

1.7 Lift Chart 7

1.8 Histogram 8

1.9 Line Plot 9

1.10 Matrix of Scatter Plots 10

1.11 Parallel Coordinates 11

1.12 Hierarchical Cluster Analysis with SOTA (Self Organizing Tree Algorithm) 12

1.13 Conclusion 13

2 Training Decision Trees for a More Meaningful Accuracy (150 Patients with Pneumonia) 15

2.1 General Purpose 15

2.2 Primary Scientific Question 15

2.3 Example 16

2.4 Downloading the Knime Data Miner 17

2.5 Knime Workflow 18

2.6 Conclusion 20

ix

Trang 11

Part II Linear Models

3 Variance Components for Assessing the Magnitude of Random

Effects (40 Patients with Paroxysmal Tachycardias) 23

3.1 General Purpose 23

3.2 Primary Scientific Question 23

3.3 Example 23

3.4 Conclusion 27

4 Ordinal Scaling for Clinical Scores with Inconsistent Intervals (900 Patients with Different Levels of Health) 29

4.1 General Purpose 29

4.2 Primary Scientific Questions 29

4.3 Example 29

4.4 Conclusion 33

5 Loglinear Models for Assessing Incident Rates with Varying Incident Risks (12 Populations with Different Drink Consumption Patterns) 35

5.1 General Purpose 35

5.2 Primary Scientific Question 36

5.3 Example 36

5.4 Conclusion 38

6 Loglinear Modeling for Outcome Categories (Quality of Life Scores in 445 Patients with Different Personal Characteristics) 39

6.1 General Purpose 39

6.2 Primary Scientific Question 39

6.3 Example 39

6.4 Conclusion 43

7 Heterogeneity in Clinical Research: Mechanisms Responsible 45

7.1 General Purpose 45

7.2 Primary Scientific Question 45

7.3 Example 45

7.4 Conclusion 48

8 Performance Evaluation of Novel Diagnostic Tests (650 and 588 Patients with Peripheral Vascular Disease) 49

8.1 General Purpose 49

8.2 Primary Scientific Question 49

x Contents

Trang 12

8.3 Example 49

8.3.1 Binary Logistic Regression 52

8.3.2 c-Statistics 53

8.4 Conclusion 55

Part III Rules Models 9 Spectral Plots for High Sensitivity Assessment of Periodicity (6 Years’ Monthly C Reactive Protein Levels) 59

9.1 General Purpose 59

9.2 Specific Scientific Question 59

9.3 Example 59

9.4 Conclusion 63

10 Runs Test for Identifying Best Regression Models (21 Estimates of Quantity and Quality of Patient Care) 65

10.1 General Purpose 65

10.2 Primary Scientific Question 65

10.3 Example 65

10.4 Conclusion 69

11 Evolutionary Operations for Process Improvement (8 Operation Room Air Condition Settings) 71

11.1 General Purpose 71

11.2 Specific Scientific Question 71

11.3 Example 72

11.4 Conclusion 74

12 Bayesian Networks for Cause Effect Modeling (600 Patients Assessed for Longevity Factors) 75

12.1 General Purpose 75

12.2 Primary Scientific Question 75

12.3 Example 76

12.4 Binary Logistic Regression in SPSS 76

12.5 Konstanz Information Miner (Knime) 77

12.6 Knime Workflow 78

12.6.1 Confusion Matrix 79

12.6.2 Accuracy Statistics 79

12.7 Conclusion 80

Contents xi

Trang 13

13 Support Vector Machines for Imperfect Nonlinear Data

(200 Patients with Sepsis) 81

13.1 General Purpose 81

13.2 Primary Scientific Question 81

13.3 Example 82

13.4 Knime Data Miner 82

13.5 Knime Workflow 83

13.6 File Reader Node 83

13.7 The Nodes X-Partitioner, SVM Learner, SVM Predictor, X-Aggregator 84

13.8 Error Rates 84

13.9 Prediction Table 85

13.10 Conclusion 85

14 Multiple Response Sets for Visualizing Clinical Data Trends (811 Patient Visits to General Practitioners) 87

14.1 General Purpose 87

14.2 Specific Scientific Question 87

14.3 Example 87

14.4 Conclusion 93

15 Protein and DNA Sequence Mining 95

15.1 General Purpose 95

15.2 Specific Scientific Question 96

15.3 Data Base Systems on the Internet 96

15.4 Example 1 97

15.5 Example 2 98

15.6 Example 3 98

15.7 Example 4 99

15.8 Conclusion 100

16 Iteration Methods for Crossvalidations (150 Patients with Pneumonia) 101

16.1 General Purpose 101

16.2 Primary Scientific Question 101

16.3 Example 101

16.4 Downloading the Knime Data Miner 102

16.5 Knime Workflow 103

16.6 Crossvalidation 104

16.7 Conclusion 105

xii Contents

Trang 14

17 Testing Parallel-Groups with Different Sample Sizes

and Variances (5 Parallel-Group Studies) 107

17.1 General Purpose 107

17.2 Primary Scientific Question 107

17.3 Examples 108

17.4 Conclusion 109

18 Association Rules Between Exposure and Outcome (50 and 60 Patients with Coronary Risk Factors) 111

18.1 General Purpose 111

18.2 Primary Scientific Question 111

18.3 Example 111

18.3.1 Example One 113

18.3.2 Example Two 114

18.4 Conclusion 115

19 Confidence Intervals for Proportions and Differences in Proportions 117

19.1 General Purpose 117

19.2 Primary Scientific Question 117

19.3 Example 118

19.3.1 Confidence Intervals of Proportions 118

19.3.2 Confidence Intervals of Differences in Proportions 119

19.4 Conclusion 120

20 Ratio Statistics for Efficacy Analysis of New Drugs (50 Patients Treated for Hypercholesterolemia) 121

20.1 General Purpose 121

20.2 Primary Scientific Question 121

20.3 Example 122

20.4 Conclusion 125

Index 127

Contents xiii

Trang 15

Part I

Cluster Models

Trang 16

Chapter 1

Data Mining for Visualization of Health

Processes (150 Patients with Pneumonia)

1.1 General Purpose

Computerfiles of clinical data are often complex and multi-dimensional, and theyare, frequently, hard to statistically test Instead, visualization processes can besuccessfully used as an alternative approach to traditional statistical data analysis.For example, Konstanz Information Miner (KNIME) software has been devel-oped by computer scientists from Silicon Valley in collaboration with techniciansfrom Konstanz University at the Bodensee in Switzerland, and it pays particularattention to visual data analysis It is used since 2006 as a freely available packagethrough the Internet So far, it is mainly used by chemists and pharmacists, but not

by clinical investigators This chapter is to assess, whether visual processing ofclinical data may, sometimes, perform better than traditional statistical analysis

1.2 Primary Scientific Question

Can visualization processes of clinical data provide insights that remained hiddenwith traditional statistical tests?

1.3 Example

Four inflammatory markers [C-reactive protein (CRP), erythrocyte sedimentationrate (ESR), leucocyte count (leucos), and fibrinogen)] were measured In 150patients with pneumonia Based on X-ray chest clinical severity was classified as

© The Author(s) 2014

T.J Cleophas and A.H Zwinderman, Machine Learning

in Medicine —Cookbook Three, SpringerBriefs in Statistics,

DOI 10.1007/978-3-319-12163-5_1

3

Trang 17

A (mild infection), B (medium severity), C (severe infection) One scientificquestion was to assess whether the markers could adequately predict the severity ofinfection.

CRP Leucos ’ Fibrinogen ESR X-ray severity 120.00 5.00 11.00 60.00 A

Leucos leucocyte count (*109/l)

Fibrinogen fibrinogen level (mg/l)

ESR erythrocyte sedimentation rate (mm)

X-ray severity X-chest severity pneumonia score (A − C = mild to severe)

Thefirst 15 patients are above The entire data file is entitled “decision tree”, and

is available in http://extras.springer.com Data analysis of these data in SPSS israther limited Start by opening the datafile in SPSS statistical software

Command:

click Graphs…Legacy Dialogs…Bar Charts…click Simple…click Define…Category Axis: enter “severity score”…Variable: enter CRP…mark Other sta-tistics…click OK

After performing the same procedure for the other variables four graphs areproduced as shown underneath The mean levels of all of the inflammatory markersconsistently tended to rise with increasing severities of infection Univariate mul-tinomial logistic regression with severity as outcome gives a significant effect of all

of the markers However, this effect is largely lost in the multiple multinomiallogistic regression, probably due to interactions

4 1 Data Mining for Visualization of Health …

Trang 18

We are interested to explore these results for additional effects, for example,hidden data effects, like different predictive effects and frequency distributions fordifferent subgroups For that purpose KNIME data miner will be applied SPSS datafiles can not be downloaded directly in the KNIME software, but excel files can,and SPSS data can be saved as an excel file (the cvs file type available in yourcomputer must be used).

Command in SPSS:

click File…click Save as…in “Save as” type: enter Comma Delimited (*.csv)…click Save

1.4 Knime Data Miner

In Google enter the term“knime” Click Download and follow instructions Aftercompleting the pretty easy download procedure, open the knime workbench byclicking the knime welcome screen The center of the screen displays the workfloweditor like the canvas in SPSS Modeler It is empty, and can be used to build astream of nodes, called workflow in knime The node repository is in the left lower

1.3 Example 5

Trang 19

angle of the screen, and the nodes can be dragged to the workflow editor simply byleft-clicking The nodes are computer tools for data analysis like visualization andstatistical processes Node description is in the right upper angle of the screen.Before the nodes can be used, they have to be connected with the“file reader” node,and with one another by arrows drawn again simply by left clicking the smalltriangles attached to the nodes Right clicking on thefile reader enables to configurefrom your computer a requested datafile…click Browse…and download from theappropriate folder a csv type Excelfile You are set for analysis now For conve-nience an CSVfile entitled “decisiontree” has been made available athttp://extras.springer.com.

Trang 20

1.6 Box and Whiskers Plots

In the node repository find the node Box Plot First click the IO option (import/export option nodes) Then click“Read”, then the File Reader node is displayed,and can be dragged by left clicking to the workflow editor Enter the requested datafile as described above A Node dialog is displayed underneath the node entitledNode 1 Its light is orange at this stage, and should turn green before it can beapplied If you right click the node’s center, and then left click File Table a preview

of the data is supplied

Now, in the search box of the node repositoryfind and click Data Views…then

“Box plot”…drag to workflow editor…connect with arrow to File reader…rightclick File reader…right click execute…right click Box Plot node…right clickConfigurate…right click Execute and open view…

The above box plots with 95 % confidence intervals of the four variable aredisplayed The ESR plot shows that also outliers have been displayed The smallestconfidence interval has the leucocyte count, and it may, thus, be the best predictor

1.7 Lift Chart

In the node repository…click Lift Chart and drag to workflow editor… connectwith arrow to File reader…right click execute Lift Chart node…right click Confi-gurate…right click Execute and open view…

1.6 Box and Whiskers Plots 7

Trang 21

The lift chart shows the predictive performance of the data assuming that thefour inflammatory markers are predictors and the severity score is the outcome Ifthe predictive performance is no better than random, the ratio successful predictionwith/without the model = 1.000 (the green line) The x-axis give dociles(1 = 10 = 10 % of the entire sample etc.) It can be observed that at 7 or moredociles the predictive performance start to be pretty good (with ratios of2.100–2.400 Logistic regression (here multinomial logistic regression) is beingused by Knime for making predictions.

1.8 Histogram

In the node repository click type color…click the color manager node and drag toworkflow editor…in node repository click color…click the Esc button of yourcomputer…click Data Views…select interactive histogram and transfer to work-flow editor…connect color manager node with File Reader…connect color man-ager with “interactive histogram node”…right click Configurate…right clickExecute and open view…

8 1 Data Mining for Visualization of Health …

Trang 22

Interactive histograms with bins of ESR values are given The colors provide theproportions of cases with mild severity (A, red), medium severity (B, green), andsevere pneumonias (C, blue) It can be observed that many mild cases (red) are inthe ESR 44–71 mm cut-off Above ESR of 80 mm blue (severe pneumonia) isincreasingly present The software program has selected only the ESR values

44–134 Instead of histograms with ESR, those with other predictor variables can bemade

1.9 Line Plot

In the node repository click Data Views…select the node Line plots and transfer toworkflow editor…connect color manager with “Line plots”…right click Configu-rate…right click Execute and open view…

1.8 Histogram 9

Trang 23

The line plot gives the values of all cases along the x-axis The upper curve arethe CRP values The middle one the ESR values The lower part are the leucos andfibrinogen values The rows 0–50 are the cases with mild pneumonia, the rows

51–100 the medium severity cases, and the rows 101–150 the severe cases It can beobserved that particularly the CRP-, fibrinogen-, and leucos levels increase withincreased severity of infection This is not observed with the ESR levels

1.10 Matrix of Scatter Plots

In the node repository click Data Views…select “Matrix of scatter plots” andtransfer to workflow editor…connect color manager with “Matrix of scatterplots”…right click Configurate…right click Execute and open view…

10 1 Data Mining for Visualization of Health …

Trang 24

The above figure gives the results The four predictors variables are plottedagainst one another By the colors (blue for severest, red for mildest pneumonias)thefields show that the severest pneumonias are predominantly in the right upperquadrant, the mildest in the left lower quadrant.

1.11 Parallel Coordinates

In the node repository click Data Views…select “Parallel coordinates” and transfer

to workflow editor…connect color manager with “Parallel coordinates” …rightclick Configurate…right click Execute and open view…click Appearance…clickDraw (spline) Curves instead of lines…

1.10 Matrix of Scatter Plots 11

Trang 25

The abovefigure is given It shows that the leucocyte count and fibrinogen levelare excellent predictors of infection severities CRP and ESR are also adequatepredictors of infections with mild and medium severities, however, poor predictors

of levels of severe infections

1.12 Hierarchical Cluster Analysis with SOTA

(Self Organizing Tree Algorithm)

In the node repository click Mining…select the node Self Organizing tree rithm (SOTA) Learner and transfer to workflow editor…connect color managerwith “SOTA learner”…right click Configurate…right click Execute and openview…

Algo-12 1 Data Mining for Visualization of Health …

Trang 26

SOTA learning is a modified hierarchical cluster analysis, and it uses in thisexample the between-case distances offibrinogen as variable On the y-axis arestandardized distances of the cluster combinations Clicking the small squaresinteractively demonstrates the row numbers of the individual cases It can beobserved at the bottom of thefigure that the severity classes very well cluster, withthe mild cases (red) left, medium severity (green) in the middle, and severe cases(blue) right.

1.13 Conclusion

Clinical computerfiles are complex, and hard to statistically test Instead, zation processes can be successfully used as an alternative approach to traditionalstatistical data analysis For example, KNIME (Konstanz Information Miner)software developed by computer scientists at Konstanz University TechnicalDepartment at the Bodensee, although mainly used by chemists and pharmacists, isable to visualize multidimensional clinical data, and this approach may, sometimes,perform better than traditional statistical testing In the current example it was able

visuali-to demonstrate the clustering of inflammavisuali-tory markers visuali-to identify different classes

of pneumonia severity Also to demonstrate that leucocyte count and fibrinogenwere the best markers, and that ESR was a poor marker In all of the markers thebest predictive performance was obtained in the severest cases of disease All ofthese observations were unobserved in the traditional statistical analysis in SPSS

1.12 Hierarchical Cluster Analysis with SOTA (Self Organizing Tree Algorithm) 13

Trang 27

More background, theoretical and mathematical information of splines and archical cluster modeling are in Machine Learning in Medicine Part One, Chap 11,Non-linear modeling, pp 127–143, and Chap 15, Hierarchical cluster analysis forunsupervised data, pp 183–195, Springer Heidelberg Germany, from the sameauthors

hier-14 1 Data Mining for Visualization of Health …

Trang 28

Chapter 2

Training Decision Trees for a More

Meaningful Accuracy (150 Patients

with Pneumonia)

2.1 General Purpose

Traditionally, decision trees are used forfinding the best predictors of health risksand improvements (Chap 16 in Machine Learning in Medicine Cookbook One,

pp 97–104, Decision trees for decision analysis, Springer Heidelberg Germany,

2014, from the same authors) However, this method is not entirely appropriate,because a decision tree is built from a datafile, and, subsequently, the same data file

is applied once more for computing the health risk probabilities from the built tree.Obviously, the accuracy must be close to 100 %, because the test sample is 100 %identical to the sample used for building the tree, and, therefore, this accuracy doesnot mean too much With neural networks this problem of duplicate usage of thesame data is solved by randomly splitting the data into two samples, a trainingsample and a test sample (Chap 12 in Machine Learning in Medicine Part One,

pp 145–156, Artificial intelligence, multilayer perceptron modeling, SpringerHeidelberg Germany, 2013, from the same authors) The current chapter is to assesswhether the splitting methodology, otherwise called partitioning, is also feasible fordecision trees, and to assess its level of accuracy

2.2 Primary Scientific Question

Can inflammatory markers adequately predict pneumonia severities with the help of

a decision tree Can partitioning of the data improve the methodology and is ficient accuracy of the methodology maintained

suf-© The Author(s) 2014

T.J Cleophas and A.H Zwinderman, Machine Learning

in Medicine —Cookbook Three, SpringerBriefs in Statistics,

DOI 10.1007/978-3-319-12163-5_2

15

Trang 29

2.3 Example

Four inflammatory markers [C-reactive protein (CRP), erythrocyte sedimentationrate (ESR), leucocyte count (leucos), and fibrinogen] were measured in 150patients Based on X-ray chest clinical severity was classified as A (mild infection),

B (medium severity), C (severe infection) A major scientific question was to assesswhat markers were the best predictors of the severity of infection

CRP Leucos ’ Fibrinogen ESR X-ray severity 120.00 5.00 11.00 60.00 A

Leucos leucocyte count (*109/l)

Fibrinogen fibrinogen level (mg/l)

ESR erythrocyte sedimentation rate (mm)

X-ray severity X-chest severity pneumonia score (A − C = mild to severe)

Thefirst 16 patients are in the above table, the entire data file is in “decision tree”and can be obtained from“http://extras.springer.com” on the internet We will start

by opening the datafile in SPSS

Command:

click Classify…Tree…Dependent Variable: enter severity score…IndependentVariables: enter CRP, Leucos, fibrinogen, ESR…Growing Method: selectCHAID…click Output: mark Tree in table format…Criteria: Parent Node type

50, Child Node type 15…click Continue… …click OK

16 2 Training Decision Trees for a More Meaningful …

Trang 30

The above decision tree is displayed Afibrinogen level <17 is 100 % predictor

of severity score A (mild disease) Fibrinogen 17–44 gives 93 % chance of severity

B,fibrinogen 44–56 gives 81 % chance of severity B, and fibrinogen >56 gives

98 % chance of severity score C The output also shows that the overall accuracy ofthe model is 94.7 %, but we have to account that this model is somewhat flawed,because all of the data are used twice, one, for building the tree, and, second, forusing the tree for making predictions

2.4 Downloading the Knime Data Miner

In Google enter the term“knime” Click Download and follow instructions Aftercompleting the pretty easy download procedure, open the knime workbench byclicking the knime welcome screen The center of the screen displays the workfloweditor Like the canvas in SPSS Modeler, it is empty, and can be used to build astream of nodes, called workflow in knime The node repository is in the left lowerangle of the screen, and the nodes can be dragged to the workflow editor simply byleft-clicking The nodes are computer tools for data analysis like visualization andstatistical processes Node description is in the right upper angle of the screen.Before the nodes can be used, they have to be connected with the“file reader” node,and with one another by arrows, drawn, again, simply by left clicking the smalltriangles attached to the nodes Right clicking on thefile reader enables to configure

2.3 Example 17

Trang 31

from your computer a requested datafile…click Browse…and download from theappropriate folder a csv type Excelfile You are set for analysis now.

Note: the above datafile cannot be read by the file reader, and must first be saved

as csv type Excelfile For that purpose command in SPSS: click File…click Save

as…in “Save as type: enter Comma Delimited (*.csv)…click Save For yourconvenience it has been made available inhttp://extras.springer.com, and entitled

The underneath decision tree comes up It is pretty much similar to the aboveSPSS tree, although it does not use 150 cases but only 45 cases (the test sample).Fibrinogen is again the best predictor A level <29 mg/l gives you 100 % chance ofseverity score A A level 29–57.5 gives 92.1 % chance of Severity B, and a levelover 57.5 gives 100 % chance of severity C

18 2 Training Decision Trees for a More Meaningful …

Trang 32

Right clicking the scorer node gives you the accuracy statistics, and shows thatthe sensitivity of A, B, an C are respectively 100, 93.3, and 90.5 %, and that theoverall accuracy is 94 %, slightly less than that of the SPSS tree (94.7 %), but stillpretty good In addition, the current analysis is appropriate, and does not useidentical data twice.

2.5 Knime Workflow 19

Trang 33

2.6 Conclusion

Traditionally, decision trees are used forfinding the best predictors of health risksand improvements However, this method is not entirely appropriate, because adecision tree is built from a datafile, and, subsequently, the same data file is appliedonce more for computing the health risk probabilities from the built tree Obviously,the accuracy must be close to 100 %, because the test sample is 100 % identical tothe sample used for building the tree, and, therefore, this accuracy does not meantoo much A decision tree with partitioning of a training and a test sample providessimilar results, but is scientifically less flawed, because each datum is used onlyonce In spite of this, little accuracy is lost

Note

More background, theoretical and mathematical information of decision trees andneural networks are in Machine Learning in Medicine Cookbook One, Chap 16,

pp 97–104, Decision trees for decision analysis, Springer Heidelberg Germany,

2014, and in Machine Learning in Medicine Part One, Chap 12, pp 145–156,Artificial intelligence, multilayer perceptron modeling, Springer Heidelberg Ger-many, 2013, both by the same authors

20 2 Training Decision Trees for a More Meaningful …

Trang 34

Part II

Linear Models

Trang 35

Chapter 3

Variance Components for Assessing

the Magnitude of Random Effects

(40 Patients with Paroxysmal

3.2 Primary Scientific Question

Can a variance components analysis by including the random effect in the analysisreduce the unexplained variance in a study, and, thus, increase the accuracy of theanalysis model as used

3.3 Example

Variables

PAT Treat Gender cad 52.00 0.00 0.00 2.00 48.00 0.00 0.00 2.00 43.00 0.00 0.00 1.00 50.00 0.00 0.00 2.00

(continued)

© The Author(s) 2014

T.J Cleophas and A.H Zwinderman, Machine Learning

in Medicine —Cookbook Three, SpringerBriefs in Statistics,

DOI 10.1007/978-3-319-12163-5_3

23

Trang 36

Thefirst 12 of a 40 patient parallel-group study of the treatment of paroxysmaltachycardia with numbers of episodes of PAT as outcome is given above The entiredatafile is in “variancecomponents”, and is available athttp://extras.springer.com.

We had reason to believe that the presence of coronary artery disease would affectthe outcome, and, therefore, used this variable as a random rather than fixed var-iable SPSS statistical software was used for data analysis Start by opening the datafile in SPSS

Command:

Analyze…General Linear Model…Variance Components…Dependent able: enter “paroxtachyc”…Fixed Factor(s): enter “treat, gender”…RandomFactor(s): enter “corartdisease”…Model: mark Custom…Model: enter “treat,gender, cad”…click Continue…click Options…mark ANOVA…mark TypeIII…mark Sums of squares…mark Expected mean squares…click Continue…click OK

Vari-The output sheets are given underneath Vari-The Variance Estimate table gives themagnitude of the Variance due to cad, and that due to residual error (unexplainedvariance, otherwise called Error) The ratio of the Var(cad)/[Var(Error) + Var(cad)]gives the proportion of variance in the data due to the random cad effect (5.844/(28.426 + 5.844) = 0.206 = 20.6 %) This means that 79.4 % instead of 100 % ofthe error is now unexplained

(continued)

Variables

PAT Treat Gender cad 43.00 0.00 0.00 2.00 44.00 0.00 0.00 1.00 46.00 0.00 0.00 2.00 46.00 0.00 0.00 2.00 43.00 0.00 0.00 1.00 49.00 0.00 0.00 2.00 28.00 1.00 0.00 1.00 35.00 1.00 0.00 2.00 PAT episodes of paroxysmal atrial tachycardias

Treat treatment modality (0 placebo treatment, 1 active treatment)

Gender gender (0 female)

cad presence of coronary artery disease (1 no, 2 yes)

24 3 Variance Components for Assessing the Magnitude …

Trang 37

The underneath ANOVA table gives the sums of squares and mean squares ofdifferent effects E.g the mean square of cad = 139.469, and that of residualeffect = 28.426.

The underneath Expected Mean Squares table gives the results of a specialprocedure, whereby variances of bestfit quadratic functions of the variables areminimized to obtain the best unbiased estimate of the variance components A littlemental arithmetic is now required

3.3 Example 25

Trang 38

EMS expected mean squareð Þ of cad the random effectð Þ

¼ 19  Variance cadð Þ þ Variance Errorð Þ

¼ 5:844 compare with the results of the above Variance Estimates tableð Þ

It can, thus, be concluded that around 20 % of the uncertainty is in the data iscaused by the random effect

26 3 Variance Components for Assessing the Magnitude …

Trang 39

More background, theoretical and mathematical information of random effectsmodels are given in Machine Learning in Medicine Part Three, Chap 9, Randomeffects, pp 81–94, 2013, Springer Heidelberg Germany, from the same authors

3.4 Conclusion 27

Trang 40

Chapter 4

Ordinal Scaling for Clinical Scores

with Inconsistent Intervals (900 Patients

with Different Levels of Health)

4.1 General Purpose

Clinical studies often have categories as outcome, like various levels of health ordisease Multinomial regression is suitable for analysis (see Chap 4, MachineLearning in Medicine Cookbook Two, Polynomial regression for outcome cate-gories, pp 23–25, Springer Heidelberg Germany, 2014, from the same authors).However, if one or two outcome categories in a study are severely underpresented,polynomial regression is flawed, and ordinal regression including specific linkfunctions may provide a betterfit for the data

4.2 Primary Scientific Questions

This chapter is to assess how ordinal regression performs in studies where clinicalscores have inconsistent intervals

4.3 Example

In 900 patients the independent predictors for different degrees of feeling healthywere assessed The predictors included were:

© The Author(s) 2014

T.J Cleophas and A.H Zwinderman, Machine Learning

in Medicine —Cookbook Three, SpringerBriefs in Statistics,

DOI 10.1007/978-3-319-12163-5_4

29

Ngày đăng: 30/08/2021, 15:41

TỪ KHÓA LIÊN QUAN