1. Trang chủ
  2. » Thể loại khác

Machine learning in medicine cookbook two

137 7 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 137
Dung lượng 3,49 MB
File đính kèm 59. Machine Learning in Medicine - Cookbook Two.rar (3 MB)

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • Preface

  • Contents

  • Part I Cluster Models

  • 1 Nearest Neighbors for Classifying New Medicines (2 New and 25 Old Opioids)

    • 1.1…General Purpose

    • 1.2…Specific Scientific Question

    • 1.3…Example

    • 1.4…Conclusion

    • 1.5…Note

  • 2 Predicting High-Risk-Bin Memberships (1,445 Families)

    • 2.1…General Purpose

    • 2.2…Specific Scientific Question

    • 2.3…Example

    • 2.4…Optimal Binning

    • 2.5…Conclusion

    • 2.6…Note

  • 3 Predicting Outlier Memberships (2,000 Patients)

    • 3.1…General Purpose

    • 3.2…Specific Scientific Question

    • 3.3…Example

    • 3.4…Conclusion

    • 3.5…Note

  • Part II Linear Models

  • 4 Polynomial Regression for Outcome Categories (55 Patients)

    • 4.1…General Purpose

    • 4.2…Specific Scientific Question

    • 4.3…The Computer Teaches Itself to Make Predictions

    • 4.4…Conclusion

    • 4.5…Note

  • 5 Automatic Nonparametric Tests for Predictor Categories (60 and 30 Patients)

    • 5.1…General Purpose

    • 5.2…Specific Scientific Questions

    • 5.3…Example 1

    • 5.4…Example 2

    • 5.5…Conclusion

    • 5.6…Note

  • 6 Random Intercept Models for Both Outcome and Predictor Categories (55 Patients)

    • 6.1…General Purpose

    • 6.2…Specific Scientific Question

    • 6.3…Example

    • 6.4…Conclusion

    • 6.5…Note

  • 7 Automatic Regression for Maximizing Linear Relationships (55 patients)

    • 7.1…General Purpose

    • 7.2…Specific Scientific Question

    • 7.3…Data Example

    • 7.4…The Computer Teaches Itself to Make Predictions

    • 7.5…Conclusion

    • 7.6…Note

  • 8 Simulation Models for Varying Predictors (9,000 Patients)

    • 8.1…General Purpose

    • 8.2…Specific Scientific Question

    • 8.3…Conclusion

    • 8.4…Note

  • 9 Generalized Linear Mixed Models for Outcome Prediction from Mixed Data (20 Patients)

    • 9.1…General Purpose

    • 9.2…Specific Scientific Question

    • 9.3…Example

    • 9.4…Conclusion

    • 9.5…Note

  • 10 Two-stage Least Squares (35 Patients)

    • 10.1…General Purpose

    • 10.2…Primary Scientific Question

    • 10.3…Example

    • 10.4…Conclusion

    • 10.5…Note

  • 11 Autoregressive Models for Longitudinal Data (120 Mean Monthly Records of a Population of Diabetic Patients)

    • 11.1…General Purpose

    • 11.2…Specific Scientific Question

    • 11.3…Example

    • 11.4…Conclusion

    • 11.5…Note

  • Part III Rules Models

  • 12 Item Response Modeling for Analyzing Quality of Life with Better Precision (1,000 Patients)

    • 12.1…General Purpose

    • 12.2…Primary Scientific Question

    • 12.3…Example

    • 12.4…Conclusion

    • 12.5…Note

  • 13 Survival Studies with Varying Risks of Dying (50 and 60 Patients)

    • 13.1…General Purpose

    • 13.2…Primary Scientific Questions

    • 13.3…Examples

      • 13.3.1 Cox Regression with a Time-Dependent Predictor

      • 13.3.2 Cox Regression with a Segmented Time-Dependent Predictor

    • 13.4…Conclusion

    • 13.5…Note

  • 14 Fuzzy Logic for Improved Precision of Dose-Response Data

    • 14.1…General Purpose

    • 14.2…Specific Scientific Question

    • 14.3…Example

    • 14.4…Conclusion

    • 14.5…Note

  • 15 Automatic Data Mining for the Best Treatment of a Disease (90 Patients)

    • 15.1…General Purpose

    • 15.2…Specific Scientific Question

    • 15.3…Example

    • 15.4…Step 1 Open SPSS Modeler

    • 15.5…Step 2 The Distribution Node

    • 15.6…Step 3 The Data Adit Node

    • 15.7…Step 4 The Plot Node

    • 15.8…Step 5 The Web Node

    • 15.9…Step 6 The Type and C5.0 Nodes

    • 15.10…Step 7 The Output Node

    • 15.11…Conclusion

    • 15.12…Note

  • 16 Pareto Charts for Identifying the Main Factors of Multifactorial Outcomes

    • 16.1…General Purpose

    • 16.2…Primary Scientific Question

    • 16.3…Example

    • 16.4…Conclusion

    • 16.5…Note

  • 17 Radial Basis Neural Networks for Multidimensional Gaussian Data (90 Persons)

    • 17.1…General Purpose

    • 17.2…Specific Scientific Question

    • 17.3…Example

    • 17.4…The Computer Teaches Itself to Make Predictions

    • 17.5…Conclusion

    • 17.6…Note

  • 18 Automatic Modeling of Drug Efficacy Prediction (250 Patients)

    • 18.1…General Purpose

    • 18.2…Specific Scientific Question

    • 18.3…Example

    • 18.4…Step 1: Open SPSS Modeler (14.2)

    • 18.5…Step 2: The Statistics File Node

    • 18.6…Step 3: The Type Node

    • 18.7…Step 4: The Auto Numeric Node

    • 18.8…Step 5: The Expert Node

    • 18.9…Step 6: The Settings Tab

    • 18.10…Step 7: The Analysis Node

    • 18.11…Conclusion

    • 18.12 …Note

  • 19 Automatic Modeling for Clinical Event Prediction (200 Patients)

    • 19.1…General Purpose

    • 19.2…Specific Scientific Question

    • 19.3…Example

    • 19.4…Step 1: Open SPSS Modeler (14.2)

    • 19.5…Step 2: The Statistics File Node

    • 19.6…Step 3: The Type Node

    • 19.7…Step 4: The Auto Classifier Node

    • 19.8…Step 5: The Expert Tab

    • 19.9…Step 6: The Settings Tab

    • 19.10…Step 7: The Analysis Node

    • 19.11…Conclusion

    • 19.12…Note

  • 20 Automatic Newton Modeling in Clinical Pharmacology (15 Alfentanil Dosages, 15 Quinidine Time-Concentration Relationships)

    • 20.1…General Purpose

    • 20.2…Specific Scientific Question

    • 20.3…Examples

      • 20.3.1 Dose-Effectiveness Study

      • 20.3.2 Time-Concentration Study

    • 20.4…Conclusion

    • 20.5…Note

  • Index

Nội dung

General Purpose

The nearest neighbor methodology, with its historical roots in data imputation for demographic files, is being evaluated for its potential application in classifying new medicines.

Specific Scientific Question

For most diseases a whole class of drugs rather than a single compound is available Nearest neighbor methods can be used for identifying the place of a new drug within its class.

Example

Two newly developed opioid compounds are evaluated for their similarities to standard opioids, aiming to identify their potential roles in therapeutic regimens This assessment includes a detailed analysis of the characteristics of 25 established opioids alongside the two new compounds.

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

The analysis of various drugs reveals their scores across multiple categories: Buprenorphine scores 7.00 in analgesia and 4.00 in antitussive effects, with an elimination time of 9.00 hours Butorphanol, with an analgesia score of 7.00, has a lower abuse score of 2.70 and an elimination time of 4.00 hours Codeine shows a balanced profile with a 5.00 analgesia score and a higher antitussive score of 6.00, alongside a 7.00-hour elimination time Heroine stands out with an 8.00 score in analgesia and an alarming abuse score of 10.00 Hydromorphone and Levorphanol both achieve an 8.00 analgesia score, while Methadone excels with a 9.00 analgesia score and a notable elimination time of 25.00 hours Morphine maintains a consistent profile with an 8.00 analgesia score and a 5.00-hour elimination time Oxycodone and Oxymorphine show moderate scores in various categories, while Nalbuphine and Pentazocine have lower abuse scores Naloxone and Naltrexone, while effective in certain areas, exhibit low analgesia scores of 1.00 Fentanyl and Alfentanil present balanced scores, with Fentanyl achieving a 6.00 analgesia score and a 5.00-hour elimination time This comprehensive overview highlights the varying efficacy and safety profiles of these medications.

4 1 Nearest Neighbors for Classifying New Medicines (2 New and 25 Old Opioids)

The table presents a comparative analysis of various drugs based on several key scores: analgesia, antitussive, constipation, respiratory depression, and abuse liability Meptazinol scores 4.00 for analgesia and 5.00 for respiratory effects, with an elimination time of 2.00 hours Norpropoxyphene shows higher scores, achieving 8.00 in analgesia and 7.00 in abuse liability, with a 4.00-hour elimination time Sufentanil also ranks high with a 7.00 analgesia score and an 8.00 abuse score, having a longer duration of 5.00 hours Newdrug1 presents a balanced profile with a 5.00 analgesia score but a lower constipation score of 4.00, while Newdrug2 excels with an 8.00 analgesia score and a 16.00-hour duration These metrics are crucial for evaluating the therapeutic potential and safety profiles of these medications.

The data file titled "Chap1nearestneighbor" is accessible at extras.springer.com and is analyzed using SPSS statistical software To begin, open the data file, which contains eight variables related to drug names, along with a ninth variable for additional analysis.

‘‘partition’’ must be added with the value 1 for the opioids 1–25 and 0 for the two new compounds (cases 26 and 27).

To conduct a nearest neighbor analysis, begin by entering the variable "drugs-name" in the Target section Next, input the variables ranging from "analgesia" to "duration of analgesia" in the Features section Then, click on Partitions and select the option to use a variable for assigning cases, followed by entering the appropriate variable.

6 1 Nearest Neighbors for Classifying New Medicines (2 New and 25 Old Opioids)

The figure illustrates the positioning of two new compounds (represented as small triangles) in relation to standard opioids, with lines connecting them to their three nearest neighbors In SPSS, users can enhance the visualization of this data by double-clicking the graph to access the "model viewer," allowing for interactive rotation to better observe distances While SPSS defaults to displaying three nearest neighbors, this setting can be adjusted as needed The compounds are listed in alphabetical order, and the analysis considers only three out of seven variables.

In the initial figure, clicking on one of the small triangles reveals an auxiliary view that provides detailed analysis The upper left graph indicates that opioids 21, 3, and 23 exhibit the best average nearest neighbor records for case 26 (new drug 1) Additionally, the seven accompanying figures illustrate the distances between these three opioids and the case.

26 for each of the seven features (otherwise called predictor variables).

Clicking on the triangle for case 27 (newdrug 2) reveals its connections with nearby drugs, as illustrated in the main view for drug 2 By repeating the same action, the auxiliary view displays the relationships with opioids 3, providing further insights into the drug's network.

8 1 Nearest Neighbors for Classifying New Medicines (2 New and 25 Old Opioids)

In case 27, which pertains to new drug 2, records indicate that 1 and 11 exhibit the best average nearest neighbor performance Additionally, the seven figures displayed alongside and beneath this main figure illustrate the distances between these three cases and case 27 across each of the seven predictor variables An auxiliary view is also provided below.

Conclusion

The nearest neighbor methodology effectively identifies the positions of new drugs within their respective classes For instance, by comparing newly developed opioid compounds with established opioids, it becomes possible to assess their potential roles in therapeutic regimens.

Note

The nearest neighbor cluster methodology has a rich history, originally developed for imputing missing data in demographic datasets This technique, detailed in "Statistics Applied to Clinical Studies, 5th Edition," has proven effective in addressing gaps in data, particularly in clinical research contexts.

10 1 Nearest Neighbors for Classifying New Medicines (2 New and 25 Old Opioids)

Predicting High-Risk-Bin Memberships

General Purpose

Optimal bins transform continuous predictor variables into well-defined categories, enhancing predictive accuracy for assessing high-risk families for bank loan defaults This method also aids in establishing health risk thresholds for individual families by analyzing their specific characteristics.

Specific Scientific Question

Can optimal binning also be applied for other medical purposes, e.g., for finding high risk cut-offs for overweight children in particular families?

Example

A study analyzed a data file of 1,445 families to identify optimal cut-off values for unhealthy lifestyle indicators, aiming to enhance the distinction between low and high-risk overweight children These identified cut-off values were then utilized to evaluate the risk profiles of individual families in the future.

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

Var 1 Var 2 Var 3 Var 4 Var 5

Var 1fruitvegetables (times per week)

Var 2 unhealthysnacks (times per week)

Var 3 fastfoodmeal (times per week)

Var 4 physicalactivities (times per week)

Only the first 10 families of the original learning data file are given, the entire data file is entitled ‘‘chap2optimalbinning’’ and is in extras.springer.com.

Optimal Binning

SPSS 19.0 is used for analysis Start by opening the data file.

To optimize data binning for variables such as fruit and vegetables, unhealthy snacks, fast food meals, and physical activities, focus on addressing the issue of overweight children Begin by transforming these variables into bins and then optimize them accordingly Once the bins are established, display key elements like endpoints, descriptive statistics, and model entropy Save the binned data by creating new variables and storing the binning rules in a syntax file To do this, browse to the appropriate folder, name the file (e.g., "exportoptimalbinning"), and save your work.

End point Number of cases by level of overweight children Bin

Lower Upper No Yes Total

12 2 Predicting High-Risk-Bin Memberships (1,445 Families) unhealthysnacks/wk

End point Number of cases by level of overweight children Bin

Lower Upper No Yes Total

Bin End point Number of cases by level of overweight children

Lower Upper No Yes Total

End point Number of cases by level of overweight children Bin

Lower Upper No Yes Total

Each bin is computed as lower \ = physical activities/wk \ Upper a Unbounded

The table presented in the output sheets highlights the high-risk cut-offs for overweight children based on four predictive factors For instance, among 1,142 families consuming fewer than 14 units of fruits and vegetables per week, the proportion of overweight children is significantly higher at 30% (340 out of 1,142), compared to just 10% (29 out of 303) in families exceeding 14 units weekly Similar high-risk thresholds are identified for unhealthy snacks (fewer than 12, 12–19, and over 19 per week), fast food meals (fewer than 2 and over 2 per week), and physical activity levels (fewer than 8 and over 8 per week) These cut-offs will serve as important guidelines for eleven future families.

Var 1fruitvegetables (times per week)

Var 2 unhealthysnacks (times per week)

Var 3 fastfoodmeal (times per week)

Var 4 physicalactivities (times per week)

To compute the predicted bins for future families, utilize the saved syntax file named "exportoptimalbinning.sps." Create a new data file, such as "chap2optimalbinning2," by entering the specified values and saving it in the correct folder on your computer Next, open the "exportoptimalbinning.sps" data file to proceed with your analysis.

…subsequently click File…click Open…click Data…Find the data file entitled

‘‘optimalbinning2’’…click Open…click ‘‘exportoptimalbinning.sps’’ from the file palette at the bottom of the screen…click Run…click All.

When returning to the Data View of ‘‘chap2optimalbinning2’’, we will find the underneath overview of all of the bins selected for our eleven future families.

Fruit Snacks Fastfood Physical Fruit Snacks Fastfood Physical

This overview is relevant, since families in high risk bins would particularly qualify for counseling.

14 2 Predicting High-Risk-Bin Memberships (1,445 Families)

Conclusion

Optimal bins effectively categorize continuous predictor variables for enhanced prediction accuracy, allowing SPSS statistical software to create a syntax file (SPS file) for determining risk cut-offs in future families This approach enables the easy identification of families at high risk for overweight Similarly, decision tree nodes serve comparable functions, focusing on subgroups of cases rather than multiple bins for individual cases, as discussed in "Machine Learning in Medicine Cookbook" (Cookbook One, Chap 16, pp 97–104, Springer Heidelberg Germany, 2014).

Note

For a comprehensive understanding of optimal binning, refer to "Machine Learning in Medicine Part Three," Chapter 5 (pp 37–48) and "Machine Learning in Medicine Cookbook One," Chapter 19 (pp 101–106), both published by Springer Heidelberg, Germany, in 2013 and 2014, respectively These works provide essential theoretical and mathematical insights into the optimal binning process, authored by the same experts in the field.

General Purpose

In the realm of large data files, recognizing outliers necessitates advanced techniques beyond conventional data visualizations and regression methods This chapter explores the effectiveness of BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) clustering in predicting outliers among future patients based on a known population.

Specific Scientific Question

Is the XML (eXtended Markup Language) file from a 2,000 patient sample capable of making predictions about cluster memberships and outlierships in future patients from the target population.

Example

A study involving 2,000 hospital admissions identified 576 potentially iatrogenic cases To analyze the data, a two-step BIRCH cluster analysis will be conducted based on patient age and the number of co-medications, utilizing SPSS version 19 or higher The data for the first 10 patients is provided below, while the complete dataset can be accessed at extras.springer.com under the title "chap3outlierdetection."

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

Age Gender Admis Duration Mort Iatro Comorb Comed

1920.00 2.00 23.00 8.00 0.00 1.00 3.00 3.00 admis = admission indication code duration = days of admission mort = mortality iatro = iatrogenic admission comorb = number of comorbidities comed = number of comedications

To begin the analysis, open the file and navigate to the Transform menu, then select Random Number Generators and set the Starting Point to a Fixed Value of 2,000,000 After clicking OK, proceed to Analyze and choose Two Step Cluster Analysis, entering age and co-medications as Continuous Variables For the Distance Measure, select Euclidean and mark the Clustering Criterion as Schwarz’s Bayesian Criterion In the Options menu, enable noise handling with a percentage of 25 and assume standardized values for age and co-medications before clicking Continue Ensure to mark Pivot tables and Charts and tables in the Model Viewer, then create a Cluster membership variable in the Working Data File In the XML Files section, export the final model by browsing to the desired folder on your computer, naming the file (e.g., "exportanomalydetection"), and clicking Save, followed by Continue and OK to finalize the process.

In the output sheets the underneath distribution of clusters is given.

In the study detailed in "Machine Learning in Medicine Part Two," Chapter 10, anomaly detection, it was found that a significant portion of outliers consisted of patients across various ages with numerous co-medications Upon accessing the Data View screen in SPSS, a new variable named "TSC_5980" was generated, indicating the cluster memberships of patients, where those assigned the value of -1 are identified as outliers Utilizing the Scoring Wizard along with the exported XML file titled "exportanomalydetection," predictions can be made regarding future patients' cluster memberships based on their age and number of co-medications, aligning with the developed XML model.

1928.00 4.00 comed = number of co-medications

Enter the above data in a novel data file and command:

To begin, navigate to Utilities and select Scoring Wizard Then, click on Browse to locate the folder containing the XML file named "exportanomalydetection." After selecting the file, proceed by double-clicking Next in the Scoring Wizard and ensure to mark the Predicted Value option to view your results.

Predicted Value = predicted cluster membership

The SPSS data file has successfully generated the new variable as requested, revealing that one patient belongs to cluster 1, two patients are categorized in cluster 3, and four patients are identified as part of the outlier cluster.

Conclusion

An XML (eXtended Markup Language) file from a 2,000 patient sample is capable of making predictions about cluster memberships and outlierships in future patients from the same target population.

Linear Models

General Purpose

To assess whether polynomial regression can be trained to make predictions about

(1) patients being in a category and (2) the probability of it.

Specific Scientific Question

Patients from various hospital departments and age groups are evaluated for incidents of falling out of bed, categorized as 0 for no fall, 1 for a fall without injury, and 2 for a fall with injury The fall-out-of-bed categories serve as the outcome variable, while the department and age act as predictors This data can be utilized to train a predictive model that forecasts the likelihood of future patients fitting into specific categories based on their characteristics.

Only the first 10 patients are given, the entire data file is entitled

‘‘chap4categoriesasoutcome’’ and is in extras.springer.com.

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

The Computer Teaches Itself to Make Predictions

SPSS versions 18 and newer are capable of generating an XML (eXtended Markup Language) file for the prediction model based on the provided data To begin, we will open the specified data file.

To analyze the fallout of bed using multinomial logistic regression, start by selecting Transform and then Random Number Generators Set the Starting Point to a Fixed Value of 2,000,000 and proceed by clicking OK Next, navigate to Analyze, choose Regression, and select Multinomial Logistic Regression Designate 'falloutofbed' as the dependent variable, with 'department' as the factor and 'age' as the covariate Save the results by marking Estimated response probability, Predicted category, Predicted category probability, and Actual category probability Finally, click Browse to select the appropriate folder on your computer, and enter 'exportcategoriesasoutcome' in the File name field to save your output.

….click Save….click Continue….click OK.

Fall with/out injury a B Std. error

(Department = 1.00) 0 b 0 a The reference catagory is: 2 b This parameter is set to zero because it is redundant

The analysis reveals key independent predictors of fallout of beds, indicating that for each additional year of age, there is a 0.943 decrease in the likelihood of experiencing "no fallout of beds" compared to "fallout of beds with injury." Additionally, department 0.00 shows a 0.143 reduction in fallout of beds with injury versus without injury, with p-values of 0.045 and 0.030 respectively Furthermore, upon reviewing the main data view, SPSS has generated six novel variables for each patient.

1 EST1_1 estimated response probability (probability of the category 0 for each patient)

24 4 Polynomial Regression for Outcome Categories (55 Patients)

4 PRE_1 predicted category (category with highest probability score)

5 PCP_1 predicted category probability (the highest probability score predicted by model)

6 ACP_1 actual category probability (the highest probability computed from data).

Using the Scoring Wizard and the XML file named "exportcategoriesasoutcome," we can predict the most likely category and its corresponding probability for future patients based on their department and age The details for 12 new patients are provided for analysis.

Enter the above data in a novel data file and command:

To utilize the Scoring Wizard, first navigate to the Utilities menu and select Scoring Wizard Then, click on Browse to open the folder containing the XML file named "exportcategoriesasoutcome." Select this file and proceed by double-clicking Next in the Scoring Wizard Ensure to mark the options for Predicted category and Probability of it before clicking Finish to complete the process.

Department Age Probability of being in predicted category Predicted category

4.3 The Computer Teaches Itself to Make Predictions 25

Department Age Probability of being in predicted category Predicted category

The SPSS data file includes two new variables as requested The first patient, aged 73 from department 0.00, has a 48% probability of experiencing a "fallout of bed without injury," while his/her chances of falling into the other two categories are less than 48%.

Conclusion

Multinomial or polynomial logistic regression can be readily trained to make predictions in future patients about their best fit category and the probability of being in it.

Note

More background theoretical and mathematical information of analyses using categories as outcome is available in Machine Learning in Medicine Part Two, Chap 10, Anomaly detection, pp 93–103, Springer Heidelberg Germany, 2013.

26 4 Polynomial Regression for Outcome Categories (55 Patients)

Automatic Nonparametric Tests for Predictor Categories

General Purpose

Unlike continuous data, categorical data do not require stepping functions To perform regression analysis on categorical variables, they must be recoded into multiple binary (dummy) variables Additionally, when the Gaussian distribution of outcomes is uncertain, automatic non-parametric testing offers a modern and convenient alternative.

Specific Scientific Questions

1 Does race have an effect on physical strength (the variable race has a cate- gorical rather than linear pattern).

2 Are the hours of sleep / levels of side effects different in categories treated with different sleeping pills.

Example 1

The study evaluated the impact of race and age on physical strength, measured on a scale of 0 to 100, among 60 participants from diverse racial backgrounds, including Hispanics, Blacks, Asians, and Whites The detailed findings are documented in the first three columns of the data file titled "chap5categoriesaspredictor."

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

Patient number Physical strength Race Age

Only the first 10 patients are displayed above The entire data file in www.springer.com For the analysis we will use multiple linear regression. Command:

Analyze….Regression….Linear….Dependent: physical strength score…. Independent: race, age,….OK.

Model Unstandardized coefficients Standardized coefficients t Sig.

Race -0.330 1.505 -0.027 -0.219 0.827 Age -0.356 0.116 -0.383 -3.071 0.003 a Dependent variable Strengthscore

The analysis indicates that age is a significant predictor, while race does not appear to be However, the current approach is insufficient because race is treated as a stepwise function from 1 to 4, which does not align with the linear regression model's assumption of a linear relationship between the outcome variable and predictors To improve accuracy, it would be advisable to recode the race variable as a categorical variable The accompanying data overview provides further insights.

4 columns how it is manually done.

28 5 Automatic Nonparametric Tests for Predictor Categories

We subsequently, use again linear regression, but now for categorical analysis of race.

Command: click Transform….click Random Number Generators….click Set Starting Point…. click Fixed Value (2000000)….click OK….click Analyze….Regression

To export model information to an XML file, set the dependent variable as the physical strength score and the independent variables as race 1, race 3, race 4, and age Then, mark the option for Unstandardized and select "exportcategoriesaspredictor" as the file type Click Browse to enter the file name "exportcategoriesaspredictor," followed by clicking Continue and then OK to complete the process.

Model Unstandardized coefficients Standardized coefficients t Sig.

The above table is in the output It shows that race 1, 3, 4 are significant predictors of physical strength compared to race 2 The results can be interpreted as follows.

The underneath regression equation is used: yẳaỵb1x1ỵb2x2ỵb3x3ỵb4x4 a = intercept b 1 = regression coefficient for age b 2 = hispanics

If an individual is black (race 2), then x 2 , x 3 , and x 4 will turn into 0, and the regression equation becomes y = a + b 1 x 1

So, e.g., the best predicted physical strength score of a white male of 25 years of age would equal y.270+0.20 * 25-8.811*1.459,

Obviously, all of the races are negative predictors of physical strength, but the blacks scored highest and the asians lowest All of these results are adjusted for age.

Upon revisiting the data file page, it is evident that SPSS has introduced a new variable named "PRE_1," which signifies the predicted individual strength scores derived from the recoded linear model These predicted scores closely resemble the actual measured values.

Using the Scoring Wizard and the exported XML file named "exportcategoriesaspredictor," we can now predict the strength scores of future patients based on their known race and age.

First, recode the stepping variable race into 4 categorical variables.

30 5 Automatic Nonparametric Tests for Predictor Categories

Race Age Race1 Race3 Race4

Then command: click Utilities….click Scoring Wizard….click Browse….click Select….Folder: enter the exportcategoriesaspredictor.xml file….click Select….in Scoring Wizard click Next….click Finish.

Race Age Race1 Race3 Race4 Predicted strength score

The above data file now gives predicted strength scores of the 8 future patients as computed with help of the XML file.

Logistic regression in SPSS allows for the convenient analysis of binary outcome variables without the need to manually convert quantitative estimators into categorical ones By utilizing standard commands, researchers can effectively analyze covariates within this framework.

To conduct a binary logistic regression analysis, begin by identifying your dependent variable and independent variables Next, open the dialog box labeled "Categorical Variables," select the relevant categorical variable, and transfer it to the Categorical Variables box Finally, click "Continue" and then "OK" to obtain the results.

Example 2

In cases where Gaussian distributions in outcomes are uncertain, automatic non-parametric testing offers a modern and effective alternative In a study involving three parallel groups, participants were treated with different sleeping pills, and both the total hours of sleep and side effect scores were evaluated.

Group Efficacy Gender Comorbidity Side effect score

The first ten patients are presented, while the complete data file can be accessed at extra-s.springer.com under the title "chap5categoriesaspredictor2." For analysis, utilize the automatic non-parametric tests feature available in SPSS version 18 and above Begin your process by opening the specified data file.

Analyze….Nonparametric Tests….Independent Samples….click Objective…. mark Automatically compare distributions across groups….click Fields….in Test fields: enter ‘‘hours of sleep’’ and ‘‘side effect score’’….in Groups: enter

‘‘group’’….click Settings….Choose Tests….mark ‘‘Automatically choose the tests based on the data‘‘….click Run.

The interactive output sheets display a table indicating significant differences in both hours of sleep and side effect scores across three treatment categories While a traditional multivariate analysis of variance (MANOVA) could analyze these data with treatment category as the predictor and both outcomes as dependent variables, the assumption of normal distributions is questionable in this case Additionally, the potential correlation between the two outcome measures may compromise the sensitivity of the MANOVA results.

Automatic nonparametric tests, similar to discriminant analysis, assume orthogonality between two outcomes, allowing for the analysis without considering their correlation This characteristic simplifies the evaluation process, making it a valuable tool in various applications, including medical research.

32 5 Automatic Nonparametric Tests for Predictor Categories table you will obtain an interactive set of views of various details of the analysis, entitled the Model Viewer.

The box and whiskers graphs illustrate the medians, quartiles, and ranges of sleep hours across three treatment groups While Group 0 appears to outperform the other two groups, the specific significant differences between them remain unclear.

The box and whiskers graph illustrating side effect scores reveals varying performance levels among different groups However, it remains unclear whether the differences between scores of 0 versus 1, 1 versus 2, and 0 versus 2 are statistically significant.

In the auxiliary view of the Model Viewer, users can access various options, including Pairwise Comparisons This feature reveals a distance network where yellow lines indicate statistically significant differences, while black lines represent insignificant ones Notably, the analysis shows significant differences in sleep hours between group 1 and group 0, as well as between group 2 and group 0.

5.4 Example 2 33 significant, and 1 versus 2 is not Group 0 had significantly more hours of sleep than the other two groups with p=0.044 and 0.0001.

The analysis reveals a statistically significant difference in side effect scores between group 1 and group 0, with group 0 exhibiting a superior score (p=0.035) However, comparisons between groups 1 and 0, as well as groups 1 and 2, do not show statistical significance.

2 and 1 versus 2 are not significantly different.

34 5 Automatic Nonparametric Tests for Predictor Categories

Conclusion

Before analyzing predictor variables in a regression model, it is essential to recode categorical variables rather than treating them as linear For instance, when the Gaussian distributions of the outcomes are uncertain, employing automatic non-parametric testing serves as a suitable and convenient alternative.

Note

For a deeper understanding of categories as predictors, refer to "SPSS for Starters Part Two," Chapter 5 on categorical data (pp 21–24) and "Statistics Applied to Clinical Studies," 5th Edition, Chapter 21 on races as a categorical variable (pp 244-252), both authored and edited by Springer Heidelberg, Germany, in 2012.

Random Intercept Models for Both

General Purpose

Categories play a significant role in medical research, encompassing aspects such as age groups, income levels, education, drug dosages, diagnosis classifications, and disease severity Traditional statistical methods often struggle with categorical data, typically requiring binary or continuous variables While polynomial regression can assess categories in outcomes, automatic nonparametric tests are suitable for predictors For scenarios involving multiple categories or when categories appear in both outcomes and predictors, random intercept models may enhance sensitivity in testing These models allow for the calculation of slightly different alpha values for each predictor category, leading to a better fit for the outcome category.

We should add that, instead of the above linear equation, even better results were obtained with log-linear equations (log=natural logarithm). logyẳaỵb1x1ỵb2x2ỵ .:

Specific Scientific Question

A study examined the impact of three hospital departments (no surgery, minimal surgery, extensive surgery) and three patient age groups (young, middle-aged, elderly) on the risk of falling out of bed, categorized as no fall, fall without injury, and fall with injury The research aimed to determine whether these predictor categories significantly influence the likelihood of bed falls, with or without injury, and assessed whether incorporating a random intercept improved the statistical outcomes.

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

Example

Department Falloutofbed Agecat Patient_id

Variable 1: department = department class (0 = no surgery, 1 = little surgery, 2 = lot of sur- gery)

Variable 2: falloutofbed = risk of falling out of bed (0 = fall out of bed no, 1 = yes but no injury, 2 = yes and injury)

Variable 3: agecat = patient age classes (young, middle, old)

Variable 4: patient_id = patient identification

The data presented includes only the first 10 patients from a total of 55, with the complete dataset available at extras.springer.com under the title "Chap6randomintercept.sav." Analysis can be conducted using SPSS version 20 or higher, and the initial step will involve performing a fixed intercept log-linear analysis.

To analyze the data, begin by clicking on "Analyze" and then select "Data Structure." Next, drag "patient_id" to the Subjects area on the Canvas Proceed to "Fields and Effects," and under Target, choose "fall with/out injury." In the Fixed Effects section, drag "agecat" and "department" to the Effect Builder, ensuring to mark "Include intercept." Finally, click "Run" to execute the analysis.

The underneath results show that both the various regression coefficients as well as the overall correlation coefficients between the predictors and the outcome are, generally, statistically significant.

38 6 Random Intercept Models for Both Outcome and Predictor

Subsequently, a random intercept analysis is performed.

To analyze the data, begin by selecting the Data Structure and dragging the "patient_id" to the Subjects section on the Canvas Next, navigate to Fields and Effects, choose Target, and select "fall with/out injury." In the Fixed Effects section, drag "agecat" and "department" to the Effect Builder and ensure the intercept is included For Random Effects, add a block, include the intercept, and select "patient_id" for Subject combination Click OK, then access Model Options to save the fields, marking both Predicted Value and Predicted Probability Finally, click Save and Run to execute the model.

The results of the random intercept model demonstrate improved statistical significance, with p-values of 0.007 and 0.013 for age, and 0.001 and 0.004 for department Additionally, the regression coefficients reveal p-values of 0.003 and 0.005 for age class comparisons between 0 and 2, while age class 1 versus 2 shows p-values of 0.900 and 0.998 For department comparisons, the p-values are 0.004 and 0.008 for department 0 versus 2, and 0.001 and 0.0002 for department 1 versus 2.

In the random intercept model we have also commanded predicted values (variable 7) and predicted probabilities of having the predicted values as computed by the software (variables 5 and 6).

40 6 Random Intercept Models for Both Outcome and Predictor

Variable 5: predicted probability of predicted value of target accounting the department score only

Variable 6: predicted probability of predicted value of target accounting both department and agecat scores

Variable 7: predicted value of target

Random intercept models, similar to automatic linear regression and other generalized mixed linear models, allow for the creation of XML files from the analysis These files can be utilized for predicting the likelihood of future patients falling out of bed However, SPSS employs a different software, specifically winRAR ZIP files, for this purpose.

Shareware refers to software that requires a small fee for registration after a trial period, typically around 40 days WinRAR utilizes the ZIP file format, which has been used by Microsoft since 2006 for compressing data, particularly for XML (eXtended Markup Language) files.

Conclusion

Generalized linear mixed models are suitable for analyzing data with multiple categorical variables Random intercept versions of these models provide better sensitivity of testing than fixed intercept models.

Note

More information on statistical methods for analyzing data with categories is in the Chaps 4and5of this volume.

General Purpose

Automatic linear regression is in the Statistics Base add-on module SPSS version

This chapter evaluates the effectiveness of automatic linear regression in enhancing the precision of clinical trial analyses SPSS employs various techniques, including automatic transformation of X-variables, rescaling of time and measurement values, outlier trimming, and category merging, to achieve a better data fit.

Specific Scientific Question

In a clinical crossover trial, researchers compared an old laxative to a new one, measuring the number of stools per month as the primary outcome The study analyzed the effects of the old laxative and patients' age as predictor variables The aim was to determine whether automatic linear regression yields more accurate statistical results than traditional multiple linear regression for this dataset.

Data Example

Patno Newtreat Oldtreat Age categories

T J Cleophas and A H Zwinderman, Machine Learning in Medicine—Cookbook Two, SpringerBriefs in Statistics,

Patno Newtreat Oldtreat Age categories

10.00 42.00 15.00 1.00 patno = patient number newtreat = frequency of stools on a novel laxative oldtreat = frequency of stools on an old laxative agecategories = patients’ age categories (1 = young, 2 = middle-age, 3 = old)

The analysis includes only the first 10 patients out of a total of 55, with the complete dataset available at extras.springer.com under the title "chap7automaticlinreg." Initially, we will conduct a standard multiple linear regression.

Analyze…Regression…Linear…Dependent: enter newtreat…Independent: enter oldtreat and agecategoriess…click OK.

Std Error of the Estimate

Squares df Mean Square F Sig.

Total 3380,171 34 a Dependent Variable: newtreat b Predictors: (Constant), oldtreat, agecategories

44 7 Automatic Regression for Maximizing Linear Relationships (55 patients)

The data indicates that traditional linear regression fails to demonstrate a significant impact of age, revealing only a borderline significance at p=0.047 for the old laxative Following this, an automatic linear regression analysis is conducted to further explore the relationships.

To perform an automatic linear regression analysis, begin by navigating to the Transform menu and selecting Random Number Generators, then set the starting point to a fixed value of 20,00,000 and click OK Next, proceed to the Analyze menu and select Regression, followed by Automatic Linear Regression In the Fields section, drag 'newtreat' to the Target area and 'patientno' to the Analysis Weight Additionally, drag 'oldtreat' and 'agecategories' to the Fields section Click on Build Options, mark the option to create a standard model, and in the Basics section, ensure that 'Automatically prepare data' is selected Under Model Options, check the boxes to save predicted values to the dataset and to export the model, naming the file ‘exportautomaticlin-reg’ and saving it in your desired folder Finally, click Run to execute the analysis.

The underneath Automatic linear regression results show that the two predictors agecategories and oldtreat have been transformed, respectively into merged cat- egories and a variable without outliers.

An interactive graph illustrates the predictors represented as lines, where their thickness indicates predictive strength, alongside an outcome displayed as a histogram with a Gaussian best fit Both predictors demonstrate high statistical significance, with a correlation coefficient at p

Ngày đăng: 30/08/2021, 15:47