Machine learning in medicine

Background

Traditional statistical tests struggle with large variable sets, making add-up scores a simplistic method for reduction, yet they overlook the significance of individual variables, their interactions, and unit differences In contrast, machine learning leverages training data processed by computers to enhance predictive capabilities When dealing with complex datasets containing numerous variables, advanced computational methods are essential for effective analysis.

Objective and Methods

The current book, using real data examples as well as simulated data, reviews important methods relevant for health care and research, although little used in the field so far.

Results and Conclusions

Logistic regression is one of the earliest machine learning techniques applied in health research, utilized for health profiling by predicting the risk of medical events in individuals through the analysis of specific combinations of x-variables.

2 A wonderful method for analyzing imperfect data with multiple variables is optimal scaling (Chaps 3 and 4).

3 Partial correlations analysis is the best method for removing interaction effects from large clinical data sets (Chap 5).

4 Mixed linear modeling (1), binary partitioning (2), item response modeling

Time-dependent predictor analysis, autocorrelation, and linear or log-linear regression methods are effective for evaluating data characterized by repeated measures, binary decision trees, exponential exposure-response relationships, varying values across different periods, and seasonal differences.

Clinical data sets exhibiting non-linear relationships between exposure and outcome variables necessitate specialized analytical methods These data sets can often be effectively analyzed using neural network techniques, such as multi-layer perceptron networks and radial basis function networks.

Clinical data involving multiple exposure variables are typically assessed through analysis of (co-) variance (AN(C)OVA) However, this approach fails to effectively address the relative significance of the variables and their interactions In contrast, factor analysis and hierarchical cluster analysis overcome these limitations, providing a more comprehensive understanding of the data.

Data featuring multiple outcome variables are typically examined using multivariate analysis of (co-)variance (MAN(C)OVA), which shares similar limitations with ANOVA To address these limitations, methods such as partial least squares analysis, discriminant analysis, and canonical regression are recommended.

8 Fuzzy modeling is a method suitable for modeling soft data, like data that are partially true or response patterns that are different at different times (Chap 19).

Traditional statistical tests struggle with large variable sets, making it essential to reduce their complexity While add-up scores are a straightforward approach to condensing multiple variables, they fail to consider the individual significance of each variable, their interactions, and the variations in measurement units.

Modern computational methods such as principal components analysis, partial least squares analysis, hierarchical cluster analysis, optimal scaling, and canonical regression are increasingly recognized as machine learning techniques These methods involve complex computations that necessitate computer assistance and transform raw data into actionable insights, resembling a learning process in human terms.

The novel methods offer a significant advantage by addressing the limitations of traditional approaches While these methods are commonly utilized in behavioral sciences, social sciences, marketing, operational research, and applied sciences, their application in medicine remains largely untapped This is unfortunate, especially considering the abundance of variables present in medical research.

3 a matter of time, now that the methods are increasingly available in SPSS statistical software and many other packages.

This article begins with an exploration of logistic regression for health profiling, utilizing individual combinations of x-variables to assess the risk of medical events in single patients Subsequent chapters delve into optimal scaling, an effective technique for analyzing imperfect data, and partial correlations, which excel at eliminating interaction effects Additionally, the article covers mixed linear modeling, binary partitioning, item response modeling, and time-dependent predictor analysis, providing a comprehensive overview of advanced statistical methods in health risk assessment.

(4) and autocorrelation (5) are linear or loglinear regression methods suitable for assessing data with respectively repeated measures (1), binary decision trees

The article discusses various statistical methods for analyzing data with complex relationships between exposure and outcome variables It highlights the limitations of traditional analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) when dealing with multiple variables, particularly in terms of power and integer overflow issues The text reviews advanced techniques such as factor analysis, hierarchical cluster analysis, partial least squares analysis, discriminant analysis, and canonical regression, which address these challenges Additionally, it introduces fuzzy modeling as an innovative approach for handling soft data and varying response patterns over time.

Modern methodologies offer significant advantages over traditional methods such as ANOVA and MANOVA, as they can effectively manage large datasets with multiple exposure and outcome variables while maintaining a relatively unbiased approach.

This book serves as an accessible introduction to machine learning methods in clinical research, specifically designed for clinicians and newcomers to the field Drawing from their experience as master class professors, the authors emphasize the importance of mastering statistical software, providing detailed guidance on using SPSS from the initial login to achieving final results The chapter concludes with essential machine learning terminology to enhance understanding.

Arti fi cial Intelligence

Engineering method that simulates the structures and operating principles of the human brain.

Bootstraps

Machine learning methods require significant computational resources, often utilizing bootstraps, a technique known as "random sampling from the data with replacement," to streamline calculations This approach is classified as a Monte Carlo method, enhancing the efficiency of data analysis in machine learning.

Canonical Regression

Multivariate methods such as ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance) are essential for analyzing data with multiple independent variables, while MANOVA (Multivariate Analysis of Variance) and MANCOVA (Multivariate Analysis of Covariance) are used for data with multiple dependent variables These statistical techniques provide a comprehensive approach to understanding complex datasets.

The challenge with traditional methods is their diminishing statistical power as the number of variables increases, often leading to computational issues with higher-order calculations Clinically, the focus is typically on the combined effects of variable clusters rather than their individual impacts While composite variables can aggregate separate variables, they fail to consider the relative importance, interactions, and unit differences among them In contrast, canonical analysis effectively addresses these concerns by providing overall test statistics for entire variable sets, alongside individual test statistics, offering a more comprehensive analysis than MANCOVA.

Components

The term components is often used to indicate the factors in a factor analysis, e.g., in rotated component matrix and in principle component analysis.

Cronbach’s alpha

K = number of original variables s 2 i = variance of i-th original variable s 2 T = variance of total score of the factor obtained by summing up all of the original variables

Cross-Validation

This study involves splitting data into a k-fold scale and comparing it with a k-1 fold scale to evaluate the test-retest reliability of factors identified in factor analysis It focuses on assessing the internal consistency among the original variables that contribute to each factor, ensuring robust and reliable results.

Data Dimension Reduction

Factor analysis term used to describe what it does with the data.

Data Mining

A field at the intersection of computer science and statistics, It attempts to discover patterns in large data sets.

Discretization

Converting continuous variables into discretized values in a regression model.

Discriminant Analysis

Multivariate method It is largely identical to factor analysis but goes one step further

In statistical modeling, incorporating a predictor variable, such as treatment modality, is essential for assessing its significance in clinical improvement The inquiry into whether treatment modality significantly predicts clinical outcomes can be approached by examining if clinical improvement influences the likelihood of receiving a specific treatment While this may initially appear counterintuitive, using outcomes for predictions is mathematically valid This approach effectively leverages linear cause-and-effect relationships, even when dealing with complex outcome variables.

Eigenvectors

Eigenvectors play a crucial role in factor analysis, representing the positions of original variables in relation to new factors, known as eigenvalues The scree plot visually illustrates the relative significance of these new factors alongside the original variables by utilizing their eigenvector values.

Elastic Net Regression

Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors.

Factor Analysis

Two or three unmeasured factors are identified to explain a much larger number of measured variables.

Factor Analysis Theory

ALAT (alanine aminotransferase), ASAT (aspartate aminotransferase), and gammaGT (gamma glutamyl transferase) provide insights into liver function, while urea, creatinine, and creatinine clearance indicate renal function To predict morbidity and mortality based on these variables, multiple regression is often employed; however, it is only valid when there is minimal correlation among the variables High correlation, known as collinearity or multicollinearity, can hinder the simultaneous use of these variables in a regression model, necessitating alternative methods Factor analysis offers a solution by replacing the original variables with a limited number of new variables, or factors, that maintain the highest correlation with the originals This multivariate technique, akin to MANOVA, utilizes an orthogonal approach to present the new factors, effectively mitigating the effects of collinearity by ensuring that the covariance between the variables is zero Ultimately, factor analysis transforms manifest predictor variables into latent predictor variables, enhancing the accuracy of analyses.

7 way it can be considered univariate method, but mathematically it is a multivariate method, because multiple rather than single latent variables are constructed from the predictor data available.

Factor Loadings

Factor loadings represent the correlation coefficients between the original variables and the newly estimated latent factor, while accounting for all original variables and any differences in measurement units.

Fuzzy Memberships

The universal spaces are divided into equally sized parts called membership functions.

Fuzzy Modeling

A method for modeling soft data, like data that are partially true or response patterns that are different at different times.

Fuzzy Plots

Graphs summarizing the fuzzy memberships of (for example) the imput values.

Generalization

Ability of a machine learning algorithm to perform accurately on future data.

Hierarchical Cluster Analysis

The article explores the idea that patients with similar characteristics may also exhibit related responses in areas such as drug efficacy This approach relies on large datasets and is recognized as a computationally intensive method, often classified among advanced analytical techniques in healthcare research.

Explorative data mining offers eight effective methods for analyzing drug efficacy, focusing on patients as dependent variables rather than relying on new characteristics This approach may provide more relevant insights compared to traditional machine learning techniques, such as factor analysis.

Internal Consistency Between the Original Variables

Contributing to a Factor in Factor Analysis

A strong correlation among responses to questions within a single factor is essential, as all questions should ideally predict the same outcome This correlation is quantified using Cronbach’s alpha, where a value of 0 indicates a poor relationship and 1 signifies a perfect correlation Additionally, the test-retest reliability of the original variables must be evaluated by omitting one variable; at least 80% of the data files with a missing variable should yield results consistent with those of the complete data file, indicated by alphas greater than 0.80.

Iterations

Complex mathematical models can be challenging to process, even for modern computers Current software packages utilize a method known as iterations, where multiple calculations are performed, and the one that best fits the criteria is selected.

Lasso Regression

Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0.

Latent Factors

Latent factors refer to underlying variables identified in factor analysis that are not directly measured but are inferred from the original data.

Learning

This term would largely fit the term “fitting” in statistics.

Learning Sample

Previously observed outcome data which are used by a neural network to learn to predict future outcome data as close to the observed values as possible.

Linguistic Membership Names

Each fuzzy membership is given a name, otherwise called linguistic term.

Linguistic Rules

The relationships between the fuzzy memberships of the imput data and those of the output data.

Logistic Regression

Logistic regression is akin to linear regression but differs in that it features a binary dependent variable, indicating whether an outcome is a responder or non-responder This binary classification is assessed through the log odds of the response variable.

Machine Learning

Knowledge for making predictions, obtained from processing training data through a computer Particularly modern computationally intensive methods are increasingly used for the purpose.

Monte Carlo Methods

Iterative testing in order to find the best fit solution for a statistical problem.

Multicollinearity or Collinearity

There should not be a strong correlation between different original variable values in a conventional linear regression Correlation coefficient (R) > 0.80 means the presence of multicollinearity and, thus, of a flawed multiple regression analysis.

Multidimensional Modeling

In data visualization, the y- and x-axes represent two factors, while a third factor can be illustrated using a z-axis to create a 3D graph Although additional factors can be incorporated into the model, they cannot be visually represented in 2D or 3D formats However, software programs can efficiently handle multidimensional calculations, akin to those used in multiple regression modeling.

Multilayer Perceptron Model

Neural network consistent of multiple layers of artificial neurons that after having received a signal beyond some threshold propagates it forward to the next later.

Multivariate Machine Learning Methods

The methods that always include multiple outcome variables They include discriminant analysis, canonical regression, and partial least squares.

Multivariate Method

Statistical analysis method for data with multiple outcome variables.

Network

This term would largely fit the term “model” in statistics.

Neural Network

Distribution-free method for data modeling based on layers of artificial neurons that transmit imputed information.

Optimal Scaling

The problem with linear regression is that consecutive levels of the variables are assumed to be equal, but in practice this is virtually never true Optimal scaling is a

11 method designed to maximize the relationship between a predictor and an outcome variable by adjusting their scales It makes use of discretization and regularization methods (see there).

Overdispersion, Otherwise Called Over fi tting

The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continuous variables.

Partial Correlation Analysis

A meaningful data analysis requires consideration of the significant interactions between independent variables; otherwise, the results may be misleading To address this, the study could be replicated with the interacting variables held constant Alternatively, employing partial correlation analysis can effectively isolate the effects of these interactions by controlling for the interacting variables.

Partial Least Squares

Partial least squares analysis is a multivariate method that, like factor analysis, identifies latent variables but differs in its approach by using a predetermined cluster of 4 or 5 predictor variables instead of all available predictors Unlike factor analysis, which ignores response variables, partial least squares analysis incorporates them, resulting in a more accurate fit for the response variable Additionally, this method produces correlation coefficients from multivariate linear regression rather than fitted correlation coefficients along the x and y-axes.

Pearson’s Correlation Coef fi cient (R)

R is a statistical measure that quantifies the strength of the relationship between two variables Its value ranges from -1 to +1, where 0 indicates no correlation, -1 signifies a perfect negative correlation, and +1 represents a perfect positive correlation A stronger association enhances the predictive capability of one variable over the other.

Principal Components Analysis

Radial Basis Functions

Symmetric functions around the origin, the equations of Gaussian curves are radial basis functions.

Radial Basis Function Network

Neural network, that, unlike the multilayer perceptron network, uses Gaussian instead of sigmoidal activation functions for transmission of signals.

Regularization

Correcting discretized variables for overfitting, otherwise called overdispersion.

Ridge Regression

Important method for shrinking b-values for the purpose of adjusting overdispersion.

Splines

Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes.

Supervised Learning

Machine learning using data that include both input and output data (exposure and outcome data).

Training Data

The output data of a supervised learning data set.

Triangular Fuzzy Sets

A common way of drawing the membership function with on the x-axis the imput values, on the y-axis the membership grade for each imput value.

Universal Space

A term often used with fuzzy modeling: the defined range of imput values, and defined range of output values.

Unsupervised Learning

Machine learning using data that includes only input data (exposure data).

Varimax Rotation

A "2 factor" analysis demonstrates that slight rotations of the x and y-axes can enhance model fitting This simultaneous rotation, known as varimax rotation, assumes the two new factors are completely independent However, if independence is not applicable, the x and y-axes can be rotated separately to achieve the optimal model fit for the data.

Weights

This term would largely fit the term “parameters” in statistics.

Machine learning continues to evolve, introducing a variety of novel terms that are essential for understanding the field This article will explore additional terminology in upcoming chapters, and a comprehensive index will provide a clearer overview of these concepts.

Sometimes machine learning is discussed as a discipline conflicting with statistics

There are notable differences in terminology and approach between machine learning and statistics Machine learning is primarily conducted by computer scientists who often come from diverse fields such as psychology, biology, and economics, rather than traditional statisticians with strong mathematical foundations The field of computer science is modern, featuring appealing terminologies, lucrative job opportunities, and promising income prospects compared to statistics It typically deals with larger and more complex data sets and emphasizes prediction modeling over null hypothesis testing However, the limited mathematical training of computer scientists can lead to a lack of awareness regarding the limitations of the models they employ.

Machine learning is a powerful tool for making predictions based on the analysis of training data processed by computers Given the complexity of data sets with multiple variables, advanced computational methods are essential for effective analysis This book highlights significant machine learning techniques applicable to health care and research, despite their limited use in the field to date.

Logistic regression is one of the earliest machine learning techniques applied in health research, utilized for health profiling by predicting the risk of medical events in individuals based on specific combinations of predictor variables.

3 A wonderful method for analyzing imperfect data with multiple variables is optimal scaling (Chaps 3 and 4).

4 Partial correlations analysis is the best method for removing interaction effects from large clinical datasets (Chap 5).

5 Mixed linear modeling (1), binary partitioning (2), item response modeling

Time-dependent predictor analysis, autocorrelation, and linear or log-linear regression methods are effective for evaluating data with repeated measures, binary decision trees, exponential exposure-response relationships, varying values over time, and seasonal differences These methodologies are discussed in detail in Chapters 6 through 10.

Clinical data sets exhibiting non-linear relationships between exposure and outcome variables necessitate specialized analysis techniques Neural network methods, such as multilayer perceptron networks and radial basis function networks, are often effective for this type of analysis.

Clinical data featuring multiple exposure variables are typically analyzed using analysis of (co-)variance (AN(C)OVA); however, this method fails to sufficiently address the relative significance of the variables and their interactions In contrast, factor analysis and hierarchical cluster analysis effectively overcome these limitations.

When analyzing data with multiple outcome variables, researchers often use multivariate analysis of (co-) variance (MAN(C)OVA), which shares the same limitations as ANOVA To overcome these challenges, methods such as partial least squares analysis, discriminant analysis, and canonical regression provide more robust solutions.

9 Fuzzy modeling is a method suitable for modeling soft data, like data that are partially true or response patterns that are different at different times (Chap 19).

1 O’Connor B (2012) Statistics versus machine learning, fight http://brenocon.com/blog/2008/12 Accessed 25 Aug 2012

T.J Cleophas and A.H Zwinderman, Machine Learning in Medicine,

DOI 10.1007/978-94-007-5824-7_2, © Springer Science+Business Media Dordrecht 2013

Logistic regression can be used for predicting the probability of an event in subjects at risk.

Methods and Results

It uses log linear models of the kind of the one underneath (ln = natural logarithm, a = intercept, b = regression coeffi cient, x = predictor variable ):

A comprehensive study involving 1,000 participants of varying ages tracked myocardial infarction occurrences over a decade This data enabled precise risk calculations for predicting future myocardial infarction events in individual subjects.

Conclusions

1 The methodology is currently an important way to determine, with limited health care sources, what individuals are at low risk and will, thus, be:

(3) given the assignment to be treated or not

(4) given the “do not resuscitate sticker”

Logistic Regression for Health Profi ling

2 We must take into account that some of the predictor variables may be heavily correlated with one another, and the results may, therefore, be infl ated

3 Also, the calculated risks may be true for subgroups, but for individuals less so, because of the random error

Logistic regression is a powerful statistical method used to predict the likelihood of an event occurring, such as the probability of an infarction This technique calculates the odds of an infarction by analyzing data from a group of patients, allowing for informed decision-making in medical assessments.

The odds of an infarction in a group is correlated with age, the older the patient the larger the odds

According to Fig 2.1 the odds of infarction is correlated with age, but we may ask how?

According to Fig 2.2 the relationship is not linear, but after transformation of the odds values on the y-axis into log odds values the relationship is suddenly linear.

We will, therefore, transform the linear equation y= +a bx

Fig 2.1 In a group of multiple ages the numbers of patients at risk of infarction is given by the dotted line

2 Logistic Regression for Health Profi ling

19 into a log linear equation ( ln = natural logarithm) lnodds= +a b x x( =age)

Our group consists of 1,000 subjects of different ages that have been observed for

10 years for myocardial infarctions Using SPSS statistical software, we command binary logistic regression dependent variable infarction yes/no independent variable a

The program produces a regression equation: lnodds ln pts with infarctions pts without a bx

The age is, thus, a signifi cant determinant of odds infarction (which can be used as surrogate for risk of infarction)

Fig 2.2 Relationships between the odds of infarction and age

Then, we can use the equation to predict the odds of infarction from a patient’s age:

The likelihood of infarction can be more accurately assessed by analyzing multiple independent variables For instance, a study tracking 10,000 patients over a decade records both infarction occurrences and baseline characteristics, allowing for a comprehensive evaluation of the factors influencing the risk of heart attacks.

(predictors) gender age Bmi (body mass index) systolic blood pressure cholesterol heart rate diabetes antihypertensives previous heart infarct smoker

The data are entered in SPSS, and it produces b-values (predictors of infarctions) b-values p-value

It is decided to exclude predictors that have a p-value > 0.10

The underneath regression equation is used

"lnodds infarct= +a b x 1 1 +b x 2 2 +b x 3 3 +ẳ." to calculate the best predictable y-value from every single combination of x-values

For instance, for a subject with the following characteristics (= predictor variables)

– smoker (x 10 ) the calculated odds of having an infarction in the next 10 years is the following: b-values x-values

Ln odds infarct = −0.5522 odds infarct = 0.58 = 58/100

The odds is often interpreted as risk However, the true risk is a bit smaller than the odds, and can be found by the equation risk event=1 1 1 / ( + /odds)

If odds of infarction = 0.58, then the true risk of infarction = 0.37

The above methodology is currently an important way to determine, with limited health care sources, what individuals will be:

3 given the assignment to be treated or not

4 given the “do not resuscitate sticker”

A comprehensive database is essential for accurately determining b-values, particularly in logistic models that convert predictive variables into event probabilities for individual subjects This approach is prevalent in medicine, exemplified by the TIMI (Thrombolysis In Myocardial Infarction) risk score, and is also gaining traction in strategic management, psychological testing, and other fields While linear regression typically utilizes the squared correlation coefficient (r²) to assess model fit, logistic models lack a direct equivalent Nevertheless, pseudo-R² and similar metrics have been developed to evaluate the strength of association between predictors and events in these models.

Logistic regression is a valuable tool for predicting the risk of significant events such as cardiovascular incidents, cancer diagnoses, and mortality rates However, it is essential to recognize its limitations, as this method relies on observational data, which can lead to potential misinterpretations of the results.

1 The assumption that baseline characteristics are independent of treatment effi cacies may be wrong

2 Sensitivity of testing is jeopardized if the models do not fi t the data well enough

3 Relevant clinical phenomena like unexpected toxicity effects and complete remissions can go unobserved

4 The inclusion of multiple variables in regression models raises the risk of clinically unrealistic results

A study was conducted to evaluate the risk factors for endometrial cancer among postmenopausal women The primary focus was to identify the determinants of this cancer in this specific group To analyze the data, a logistic regression model was employed, where the dependent variable represented the natural logarithm of the odds of developing endometrial cancer, and one of the key independent variables was short-term estrogen consumption.

Table 2.1 Examples of predictive models where multiple logistic regression has been applied

Dependent variable (odds of event) Independent variables (predictors)

Odds of infarction Age, comorbidity, comedication, riskfactors

2 Car producer (Strategic management research) [ 2 ]

Odds of successful car Cost, size, horse power, ancillary properties

3 Item response modeling (Rasch models for computer adapted tests) [ 3 ]

Odds of correct answer to three questions of different diffi culty

Correct answer to three previous questions

Long-term estrogen consumption significantly impacts women's health, leading to a series of interconnected risks Specifically, it is associated with a low fertility index, obesity, hypertension, and early menopause, all of which contribute to an increased likelihood of developing endometrial cancer The odds ratios reveal that consumers of estrogen have a higher chance of cancer compared to non-consumers, while patients with low fertility and obesity also exhibit elevated cancer risks These findings underscore the importance of understanding the regression coefficients, standard errors, and p-values in assessing the odds ratios related to these risk factors.

The software analysis revealed significant b-values indicating a heightened cancer risk associated with factors such as estrogen consumption, low fertility, obesity, and hypertension, with an alarming risk estimate of up to 76-fold However, this figure is likely exaggerated due to the strong correlations among these variables While logistic regression serves as a valuable exploratory research tool, its findings should be interpreted cautiously, particularly when applied to individual health profiling The calculated risks may hold true for specific subgroups, but individual assessments may be less reliable due to random error.

Logistic regression can be used for predicting the probability of an event It uses log linear models of the kind of the one underneath:

The methodology is currently an important way to determine, with limited health care sources, what individuals will be:

3 given the assignment to be treated or not

4 given the “do not resuscitate sticker”

It's important to recognize that certain variables may be strongly correlated, leading to inflated results Additionally, while calculated risks might apply to subgroups, they may not accurately reflect individual circumstances due to random error.

1 Antman EM, Cohen M, Bernink P, McGabe CH, Horacek T, Papuches G, Mautner B, Corbalan

R, Radley D, Braunwald E (2000) The TIMI risk score for unstable angina pectors, a method for prognostication and therapeutic decision making J Am Med Assoc 284:835–842

2 Hoetner G (2007) The use of logit and probit models in strategic management research Strateg Manag J 28:331–343

3 Rudner LM Computer adaptive testing http://edres.org/scripts/cat/catdemo.htm Accessed 18 Dec 2012

In clinical trials, researchers frequently utilize multiple regression analysis to evaluate various variables related to their research questions However, a key limitation of multiple regression is its assumption that the levels of these variables are equal, which is rarely the case in real-world scenarios To address this issue, optimal scaling offers a solution by enhancing the relationship between predictor and outcome variables through scale adjustments.

A simulated drug efficacy trial involving 27 variables was conducted to evaluate the effectiveness of optimal scaling compared to traditional multiple linear regression The analysis utilized the Optimal Scaling module in SPSS to assess performance outcomes.

Results

The two methods produced similarly sized results with 7 versus 6 p-values < 0.10 and 3 versus 4 p-values < 0.010 respectively

Conclusions

1 Optimal scaling using discretization is a method for analyzing clinical trials where the consecutive levels of the variables are inequal

2 In order to fully benefi t from optimal scaling a regularization procedure for the purpose of correcting overdispersion is desirable

In clinical trials, research questions are often assessed using multiple variables, such as gene expressions to predict the efficacy of cytostatic treatments, repeated measurements in randomized longitudinal trials, and multi-item personal scores for evaluating antidepressants Multiple linear regression analysis is commonly employed to evaluate the impact of predictor variables on outcome variables; however, a limitation of this method is the assumption that consecutive levels of predictor variables are equal, which is rarely the case in practice For instance, a continuous predictor variable scored on a scale of 0–10 may have missing values at certain points, and various scales, such as those with two or four parts, can be utilized, highlighting the arbitrary nature of scale selection in research.

Table 3.1 illustrates that each scale yielded distinct results, with scales 2 and 3 showing a gradual improvement in t-values and p-values Optimal scaling, developed by Albert Gifi at UCLA in 1990, aims to enhance the relationship between predictor and outcome variables using a computationally intensive method that employs quadratic approximation to identify the best scale for the data This technique is extensively discussed in statistical literature and is a significant aspect of machine learning, which focuses on enabling computers to make predictions from complex empirical data However, its application in clinical research remains limited, with only a few genetic studies identified in Medline.

[ 11 , 12 ], epidemiological [ 13 , 14 ] and psychological [ 15 ] studies, but very few therapeutic trials [ 16 , 17 ], despite its pleasant property to improve the p-values of testing and, thus, turn negative results into positive ones

This chapter presents a simulated example to evaluate the effectiveness of optimal scaling within a multiple variables model Our aim is to encourage clinical researchers to adopt this enhanced analytical approach for predictive trials.

Fig 3.1 Linear regression analysis An example of a continuous predictor variable (x-variable) on a scale 0–10 Patients with the predictor values 0, 1, 5, 9 and 10 are missing

Table 3.1 Linear regression analysis of the data from Fig 3.1 using three different scales

With the scales 2 and 3 a gradual improvement of the t-values and p-values is observed a Dependent variable: outcome

Cross-Validation

Splitting the data into a k-fold scale and comparing it with a k−1 fold scale.

Discretization

Converting continuous variables into discretized values in a regression model.

Elastic Net Regression

Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors.

Lasso Regression

Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0.

The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continuous variables.

Monte Carlo Methods

Iterative testing in order to fi nd the best fi t solution for a statistical problem.

Regularization

Correcting discretized variables for overfi tting, otherwise called overdispersion

Ridge Regression

Important method for shrinking b-values for the purpose of adjusting overdispersion.

Splines

Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes

Optimal scaling makes use of processes like discretization (converting continuous variables into discretized values), and regularization (correcting discretized variables for overfi tting, otherwise called overdispersion) [ 8 – 11 ]

To convert continuous data into a discrete model, the quadratic approximation is an effective method, represented by the formula fx = fa + f [1]a (x−a), where f [1]a denotes the first derivative of the function at point a This approximation relies on the principle that the quadratic model is the simplest alternative to linear models The first derivative, which indicates the slope of the function, effectively captures the function's magnitude This technique is particularly useful for analyzing complex functions, such as standard errors, and for determining the optimal discretization distance between a given x-value and its nearest a-value, which serves as the best fit scale for the data.

To enhance the best fit scale for a variable, SPSS offers the capability to split a linear variable into two segments, known as splines, which allows for the combination of two linear functions to effectively model non-linear patterns.

The study analyzed a dataset of 250 patients, encompassing 27 variables related to microarray gene expression levels and drug efficacy scores, with the complete data file available in the appendix All variables were standardized on an 11-point linear scale ranging from 0 to 10 Notably, genes 1–4, 16–19, and 24–27 exhibited high expression levels The outcome was assessed using composite scores derived from variables 20–23.

The analysis utilized SPSS statistical software to conduct a traditional multiple linear regression, with gene expression levels serving as predictors and the drug efficacy composite score as the outcome variable The results, detailed in Table 3.2, indicate a significant overall r-square value.

0.725 In order to improve the scaling of the linear regression model the Optimal Scaling program of SPSS was used.

Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var 28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1,

2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….OK

Table 3.3 presents the results, indicating that the intercept has been removed and t-tests have been substituted with F-tests The optimally scaled model, which does not employ regularization, demonstrates effects similar in size to those of the traditional multiple linear regression model, with a slightly higher overall R-squared value of 0.736 compared to 0.725, and an increase in significant p-values from 3 to 4 for p < 0.010 To maximize the benefits of optimal scaling, implementing a regularization procedure to address overdispersion is recommended, which will be discussed in the next chapter.

Traditional linear regression struggles with multiple independent variables due to the unequal levels of these variables Optimal scaling is an effective method to address this issue, but it may lead to power loss as a result of overdispersion.

Table 3.2 Traditional multiple linear regression with drug effi cacy score (a composite score of the variables 20–23) as outcome and 12 gene expression levels as predictor

A sharp increase in t-values for certain x-values can occur when other x-values are removed, a phenomenon known as unstable regression coefficients or "bouncing betas." This instability often arises from correlated predictors or when there are too many predictors relative to the number of observations Shrinking regression coefficients has been shown to effectively address overdispersion and mitigate this instability.

Optimal scaling is a computationally intensive method often categorized under machine learning, as it generates predictive models through computer programs Other machine learning techniques include factor analysis, partial least squares (PLS), canonical regression, item response modeling, and neural networks It's important to note that optimal scaling is not a competitor to these methods; instead, it offers a distinct approach, particularly advantageous when there is a lack of homogeneity in the scales of the variables.

For those interested in analyzing the combined effects of specific subsets of variables, methods such as [20] and [21] are more appropriate In contrast, if your primary focus is to evaluate the collective impact of all predictor variables, canonical regression [22] is the most suitable choice.

The method has notable limitations, particularly regarding the independence of scale intervals from the outcome variable's magnitude, which may not always hold true Additionally, the data spread can be exaggerated due to wide scale intervals It is also important to consider that utilizing multiple scales may yield better results.

The current chapter shows that in order to fully benefi t from optimal scaling a regularization procedure is desirable Three methods are explained

Table 3.3 Optimal scaling without regularization

Standardized coeffi cients df F Sig

Boots trap (1,000) Estimate of Std error

1 Optimal scaling using discretization is a method for analyzing clinical trials where the consecutive levels of the variables are inequal

2 In order to fully benefi t from optimal scaling a regularization procedure for the purpose of correcting overdispersion is desirable

8 Appendix: Datafi le of 250 Subjects Used as Example

A study by Tsao et al (2010) published in DNA Cell Biology explores gene expression profiles to predict the efficacy of the anticancer drug 5-fluorouracil in breast cancer treatment The research highlights the potential of utilizing specific gene expressions as biomarkers for enhancing therapeutic outcomes in breast cancer patients.

In a study by Latan et al (2012), the researchers demonstrated that a microemulsion-based tacrolimus cream effectively suppresses cytokine gene expression, enhancing its therapeutic efficacy for treating atopic dermatitis The findings, published in Drug Delivery and Translational Research, highlight the potential of this formulation in improving treatment outcomes for patients with this skin condition.

3 Albertin PS (1999) Longitudinal data analysis (repeated measures) in clinical trials Stat Med 18:2863–2870

4 Yang X, Shen Q, Xu H, Shoptaw S (2007) Functional regression analysis using an F test for longitudinal data with large numbers of repeated measures Stat Med 26:1552–1566

5 Sverdlov L (2001) The fastclus procedure as an effective way to analyze clinical data In: SUGI proceedings 26, paper 224, Long Beach, CA

6 Gifi A (1990) Non linear multivariate analysis Department of Data Theory, Leiden

7 Alpaydin E (2004) Introduction to machine learning http://books.google.com Accessed 25 June 2012

8 Van der Kooij AJ (2007) Prediction accuracy and stability of regression with optimal scaling transformations Ph.D thesis, Leiden University, Netherlands

9 Hojsgaard S, Halekoh U (2005) Overdispersion Danish Institute of Agricultural Sciences Copenhagen http://gbi.agrsci.dk/statistics/courses Accessed 18 Dec 2012

10 Wang L, Gordon MD, Zhu J (2006) Regularized least absolute deviations regression and an effi cient algorithm for parameter tuning In: Sixth international conference data mining 2006 doi: 10.1109/ICDM.2006.134

11 Waaijenberg S, Zwinderman AH (2007) Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers BMC Proc 1(Suppl 1): S122–S125

In a study by Yoshiwara et al (2010), a gene expression profile was developed to predict survival outcomes in patients with advanced stage serous ovarian cancer This research analyzed data from two independent datasets, highlighting the potential of genetic markers in enhancing prognostic accuracy for this aggressive cancer type The findings were published in PLoS One, demonstrating the significance of molecular profiling in clinical decision-making for ovarian cancer management.

13 Gururajan R, Quaddus M, Xu J (2008) Clinical usefulness of handheld wireless technology in healthcare J Syst Info Technol 10:72–85

14 Kitsiou S, Manthou V, Vlachopoulou M, Markos A (2010) Adoption and sophistication of clinical information systems in geek public hospitals 12th Med Confer Medical Biological Engineering 29:1011–1016

15 Hartmann A, Van der Kooij AJ, Zeeck A (2009) Models of clinical decision making by regression with optimal scaling Psychother Res 19:482–492

16 Triantafi lidou K, Venetis G, Markos A (2012) Short term results of autologous blood Injection for treatment of habitual TMJ luxation J Craniofac Surg 23(3):689–692

17 Li Y (2008) Statistical methods in surrogate marker research http>//deepblue.lib.umich.edu/ handle Accessed 18 Dec 2012

18 SPSS statistical software (2012) www.spss.com Accessed 12 June 2012

19 Cleophas TJ, Zwinderman AH (2011) Statistics applied to clinical studies Springer, New York

20 Barthelemew DJ (1995) Spearman and the origin and development of factor analysis Br J Math Stat Psychol 48:211–220

21 Wold H (1966) Estimation of principle components and related models by iterative least squares In: Krishnaiah PR (ed) Multivariate analysis Academic Press, New York, pp 391–420

In their 2009 paper presented at the IJCAI conference on artificial intelligence, Sun et al explore the relationship between canonical correlation analysis and orthonormalized partial least squares, highlighting their equivalence The study, published by Morgan Kaufman Publishers in San Francisco, provides valuable insights into these statistical methods, contributing to the understanding of their applications in artificial intelligence research.

23 Sherlock C, Roberts G (2009) Optimal scaling of random walk Bernoulli 15:774–798

In the previous chapter, we highlighted a significant limitation of linear regression in clinical research: it assumes equal intervals between consecutive levels of predictor variables (x-variables), which rarely reflects real-world scenarios This discrepancy can impact the analysis of the effects of these predictors on outcome variables (y-axis variables).

Objective

In the current chapter we will address the subject of regularization, a method for correcting discretized variables for overdispersion.

Methods

Ridge regression, lasso regression, and elastic net regression will be demonstrated using the example from the previous chapter once more.

Results

The ridge optimal scaling model produced eight p-values < 0.01, while traditional regression and unregularized optimal scaling produced only 3 and 2 p-values < 0.01

Including Ridge, Lasso, and Elastic

Lasso optimal scaling eliminated 4 of 12 predictors from the analysis, while, of the remainder, only two were signifi cant at p < 0.01 Similarly elastic net optimal scaling did not provide additional benefi t.

Conclusions

1 Optimal scaling shows similarly sized effects compared to traditional regression

In order to benefi t from optimal scaling a regularization procedure for the purpose of correcting overdispersion is desirable

2 Ridge optimal scaling performed much better than did traditional regression giving rise to many more statistically signifi cant predictors

3 Lasso optimal scaling shrinks some b-values to zero, and is particularly suitable if you are looking for a limited number of strong predictors

4 Elastic net optimal scaling works better than lasso if the number of predictors is larger than the number of observations

In the previous chapter, we highlighted that linear regression, frequently utilized in clinical research to assess the impact of predictor variables (x-variables) on outcome variables (y-axis), faces a significant issue This method assumes that consecutive levels of predictor variables are equal, a condition that rarely holds true in real-world scenarios.

Optimal scaling significantly enhances testing sensitivity, but it is crucial to address the risk of data overdispersion to fully leverage this methodology Robert Tibshirani, a statistics professor at Stanford University, proposed solutions in 1996, highlighting the potential of shrinking regression coefficients to improve the fit of regression models.

In the current chapter, using the example from the previous chapter, we will address the subject of regularization, a method for correcting discretized variables for overfi tting, otherwise called overdispersion

Discretization

Converting continuous variables into discretized values in a regression model

4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression

Splines

Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes.

The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continuous variables.

Regularization

Correcting discretized variables for overfi tting, otherwise called overdispersion.

Ridge Regression

Important method for shrinking b-values for the purpose of adjusting overdispersion

Iterative testing in order to fi nd the best fi t solution for a statistical problem.

Cross-Validation

Splitting the data into a k-fold scale and comparing it with a k-1 fold scale.

Lasso Regression

Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0

Elastic Net Regression 42Contents

Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors

Optimal scaling utilizes techniques such as discretization, which transforms continuous variables into discrete values, and regularization, which adjusts these discrete variables to prevent overfitting, also known as overdispersion Discretization was discussed in the previous chapter.

Regularization is essential for correcting overdispersed models, typically increasing the standard error Various adjustment methods exist, with Hojsgaard and Halekoh recommending the [chi-square/degrees of freedom] ratio A common approach is ridge regression, which minimizes the regression coefficient b using a shrinking factor λ, where b ridge = b/(1 + λ) This method ensures that a suitable λ value can yield a better scale model than traditional linear models The Monte Carlo approach allows for multiple tests to identify the best-fit scale, while k-fold cross-validation is a standard technique for model evaluation In addition to ridge regression, SPSS provides lasso regression, which reduces the size of b-values by setting the smallest ones to zero, enhancing prediction accuracy with fewer strong predictors Conversely, ridge regression is preferable for complex models with many weak predictors Elastic net regression combines features of both lasso and ridge, performing better when the number of predictors exceeds the number of observations.

The data file from 250 patients, detailed in the previous chapter and available in the appendix, comprises 27 variables that encompass both microarray gene expression levels and drug efficacy scores Each variable was standardized using an 11-point linear scale ranging from 0 to 10, with a focus on identifying genes that exhibited significant results.

43 expressed: the genes 1–4, 16–19, and 24–27 As outcome variable composite scores of the variables 20–23 were used

The optimally scaled model without regularization demonstrates effects comparable to the traditional regression model, achieving a slightly higher overall R-squared value of 0.736 To maximize the benefits of optimal scaling, implementing a regularization procedure is essential to address overdispersion Initially, a ridge path model will be employed, utilizing SPSS statistical software for analysis.

2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….click Regularization….mark Ridge….OK

Figure 4.1 presents the adjusted b-values from the optimal Ridge scale model, as detailed in Table 4.1 The graph illustrates that the b-values for various predictors rise progressively as the factor λ decreases Additionally, the right vertical line indicates a scenario where the data spread has increased by one standard error beyond the best model represented by the left line.

Figure 4.1 illustrates the optimal scaling in ridge regression, displaying the adjusted b-values of the best-fit scale model on the left vertical line, as detailed in Table 4.1 The graph demonstrates that as the shrinking factor λ decreases from the left to the right, the b-values of various predictors gradually increase The right vertical line indicates a scenario where the data spread has risen by one standard error above the best model, leading to a corresponding deterioration in model performance.

44 deteriorated correspondingly The sensitivity of this model is much better than the traditional regression with 8 p-values < 0.01, while the traditional and unregularized Optimal Scaling only produced 3 and 2 p-values < 0.01.

Also the lasso regularization model is possible (Var = variable).

Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var

28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1, 2, 3, 4,

16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)…. Discretize: Method Grouping, Number categories 7)….click Regularization…. mark Lasso….OK

The adjusted b-values from the optimal lasso scale model, illustrated in Figure 4.2 and detailed in Table 4.2, indicate that genes 1, 3, 25, and 27 have been reduced to zero and subsequently removed from the analysis Lasso regression is particularly effective for identifying a limited set of predictors, enhancing prediction accuracy by excluding weaker predictors.

Finally, the elastic net method is applied.

2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….click Regularization….mark Elastic Net….OK

Table 4.3 presents results that closely align with those obtained using Lasso In this case, Elastic Net does not offer significant advantages, yet it outperforms Lasso when the number of predictors exceeds the number of observations.

Table 4.1 Optimal scaling with ridge regression

Bootstrap (1,000) estimate of Std error

Table 4.2 Optimal scaling with lasso regression

Figure 4.2 illustrates the lasso regression, displaying the adjusted b-values of the optimal scale model indicated by the left vertical line, with corresponding values available in Table 4.2 As the shrinking factor λ decreases from left to right on the graph, the b-values of various predictors progressively increase The right vertical line represents a scenario where the data spread has risen by one standard error above the best model, resulting in a corresponding deterioration of the model's performance.

Traditional linear regression struggles with multiple independent variables, often leading to unstable regression coefficients, commonly known as "bouncing betas." This instability occurs when predictors are correlated or when there are many predictors compared to the number of observations Shrinking regression coefficients has proven beneficial in addressing overdispersion and mitigating instability in the model.

Optimally scaled modeling demonstrated effects comparable to traditional linear regression To maximize the benefits of optimal scaling, implementing a regularization procedure to address overdispersion is recommended.

The ridge optimal scaling demonstrates superior sensitivity compared to traditional linear regression, resulting in a greater number of significant predictors in your model.

Lasso optimal scaling effectively reduces some variable b-values to zero, making it ideal for identifying a few strong predictors In contrast, elastic net optimal scaling outperforms lasso, especially when the number of predictors exceeds the number of observations.

We hope this paper will stimulate clinical investigators to start using this optimized analysis method for predictive trials

Table 4.3 Optimal scaling with elastic net (ridge 0.00 t/m/1.00)

Optimally scaled modeling produces effects comparable to traditional linear regression, but to maximize its benefits, implementing a regularization procedure is essential for correcting overdispersion.

2 Particularly, the sensitivity of the ridge optimal scaling may be better than that of traditional linear regression giving rise to more signifi cant predictors in the data

3 Lasso optimal scaling shrinks some variable b-values to zero, and is, therefore, particularly suitable if you are looking for a limited number of strong predictors

4 Elastic net optimal scaling works better than lasso if the number of predictors is larger than the number of observations

1 Tibshirani R (1996) Regression shrinkage and selection via the lasso J R Stat Soc 58:

2 Alpaydin E (2004) Introduction to machine learning http://books.google.com Accessed 25 June 2012

3 Van der Kooij AJ (2007) Prediction accuracy and stability of regression with optimal scaling transformations Ph.D thesis Leiden University, Netherlands

4 Hojsgaard S, Halekoh U (2005) Overdispersion Danish Institute of Agricultural Sciences, Copenhagen http://gbi.agrsci.dk/statistics/courses Accessed 18 Dec 2012

5 Wang L, Gordon MD, Zhu J (2006) Regularized least absolute deviations regression and an effi cient algorithm for parameter tuning 6th Int Conf Data Min doi: 10.1109/ICDM.2006.134

6 SPSS statistical software (2012) www.spss.com Accessed 12 June 2012

Clinical research outcomes are influenced by numerous interrelated factors, yet multiple regression analysis often assumes these factors operate independently This raises the question of why these variables should not have an impact on each other.

To assess the performance of partial regression analysis for the assessment of clinical trials with interaction between predictor factors.

A simulated 64 patient study of the effects of exercise on weight loss with calorie intake as covariate and a significant interaction on the outcome between the covariates.

The simple linear correlations of weight loss versus exercise and versus calorie intake were respectively 0.41 (p = 0.001) and −0.30 (p = 0.015) Multiple linear regression adjusted for interaction showed that exercise was no longer a significant

Tiêu đề	Machine Learning in Medicine
Tác giả	Ton J. Cleophas, Aeilko H. Zwinderman
Người hướng dẫn	Eugene P. Cleophas, MSc, BEng, Henny I. Cleophas-Allers
Trường học	European Interuniversity College of Pharmaceutical Medicine
Chuyên ngành	Medicine
Thể loại	book
Năm xuất bản	2013
Thành phố	Dordrecht

Định dạng
Số trang	270
Dung lượng	2,89 MB
File đính kèm	60. Machine Learning in Medicine ( PDFDrive ).rar (3 MB)