Background
Traditional statistical tests struggle with large variable sets, making add-up scores a simplistic method for reduction, yet they overlook the significance of individual variables, their interactions, and unit differences In contrast, machine learning leverages training data processed by computers to enhance predictive capabilities When dealing with complex datasets containing numerous variables, advanced computational methods are essential for effective analysis.
Objective and Methods
The current book, using real data examples as well as simulated data, reviews important methods relevant for health care and research, although little used in the field so far.
Results and Conclusions
Logistic regression is one of the earliest machine learning techniques applied in health research, utilized for health profiling by predicting the risk of medical events in individuals through the analysis of specific combinations of x-variables.
2 A wonderful method for analyzing imperfect data with multiple variables is optimal scaling (Chaps 3 and 4).
3 Partial correlations analysis is the best method for removing interaction effects from large clinical data sets (Chap 5).
4 Mixed linear modeling (1), binary partitioning (2), item response modeling
Time-dependent predictor analysis, autocorrelation, and linear or log-linear regression methods are effective for evaluating data characterized by repeated measures, binary decision trees, exponential exposure-response relationships, varying values across different periods, and seasonal differences.
Clinical data sets exhibiting non-linear relationships between exposure and outcome variables necessitate specialized analytical methods These data sets can often be effectively analyzed using neural network techniques, such as multi-layer perceptron networks and radial basis function networks.
Clinical data involving multiple exposure variables are typically assessed through analysis of (co-) variance (AN(C)OVA) However, this approach fails to effectively address the relative significance of the variables and their interactions In contrast, factor analysis and hierarchical cluster analysis overcome these limitations, providing a more comprehensive understanding of the data.
Data featuring multiple outcome variables are typically examined using multivariate analysis of (co-)variance (MAN(C)OVA), which shares similar limitations with ANOVA To address these limitations, methods such as partial least squares analysis, discriminant analysis, and canonical regression are recommended.
8 Fuzzy modeling is a method suitable for modeling soft data, like data that are partially true or response patterns that are different at different times (Chap 19).
Traditional statistical tests struggle with large variable sets, making it essential to reduce their complexity While add-up scores are a straightforward approach to condensing multiple variables, they fail to consider the individual significance of each variable, their interactions, and the variations in measurement units.
Modern computational methods such as principal components analysis, partial least squares analysis, hierarchical cluster analysis, optimal scaling, and canonical regression are increasingly recognized as machine learning techniques These methods involve complex computations that necessitate computer assistance and transform raw data into actionable insights, resembling a learning process in human terms.
The novel methods offer a significant advantage by addressing the limitations of traditional approaches While these methods are commonly utilized in behavioral sciences, social sciences, marketing, operational research, and applied sciences, their application in medicine remains largely untapped This is unfortunate, especially considering the abundance of variables present in medical research.
3 a matter of time, now that the methods are increasingly available in SPSS statistical software and many other packages.
This article begins with an exploration of logistic regression for health profiling, utilizing individual combinations of x-variables to assess the risk of medical events in single patients Subsequent chapters delve into optimal scaling, an effective technique for analyzing imperfect data, and partial correlations, which excel at eliminating interaction effects Additionally, the article covers mixed linear modeling, binary partitioning, item response modeling, and time-dependent predictor analysis, providing a comprehensive overview of advanced statistical methods in health risk assessment.
(4) and autocorrelation (5) are linear or loglinear regression methods suitable for assessing data with respectively repeated measures (1), binary decision trees
The article discusses various statistical methods for analyzing data with complex relationships between exposure and outcome variables It highlights the limitations of traditional analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) when dealing with multiple variables, particularly in terms of power and integer overflow issues The text reviews advanced techniques such as factor analysis, hierarchical cluster analysis, partial least squares analysis, discriminant analysis, and canonical regression, which address these challenges Additionally, it introduces fuzzy modeling as an innovative approach for handling soft data and varying response patterns over time.
Modern methodologies offer significant advantages over traditional methods such as ANOVA and MANOVA, as they can effectively manage large datasets with multiple exposure and outcome variables while maintaining a relatively unbiased approach.
This book serves as an accessible introduction to machine learning methods in clinical research, specifically designed for clinicians and newcomers to the field Drawing from their experience as master class professors, the authors emphasize the importance of mastering statistical software, providing detailed guidance on using SPSS from the initial login to achieving final results The chapter concludes with essential machine learning terminology to enhance understanding.
Arti fi cial Intelligence
Engineering method that simulates the structures and operating principles of the human brain.
Bootstraps
Machine learning methods require significant computational resources, often utilizing bootstraps, a technique known as "random sampling from the data with replacement," to streamline calculations This approach is classified as a Monte Carlo method, enhancing the efficiency of data analysis in machine learning.
Canonical Regression
Multivariate methods such as ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance) are essential for analyzing data with multiple independent variables, while MANOVA (Multivariate Analysis of Variance) and MANCOVA (Multivariate Analysis of Covariance) are used for data with multiple dependent variables These statistical techniques provide a comprehensive approach to understanding complex datasets.
The challenge with traditional methods is their diminishing statistical power as the number of variables increases, often leading to computational issues with higher-order calculations Clinically, the focus is typically on the combined effects of variable clusters rather than their individual impacts While composite variables can aggregate separate variables, they fail to consider the relative importance, interactions, and unit differences among them In contrast, canonical analysis effectively addresses these concerns by providing overall test statistics for entire variable sets, alongside individual test statistics, offering a more comprehensive analysis than MANCOVA.
Components
The term components is often used to indicate the factors in a factor analysis, e.g., in rotated component matrix and in principle component analysis.
Cronbach’s alpha
K = number of original variables s 2 i = variance of i-th original variable s 2 T = variance of total score of the factor obtained by summing up all of the original variables
Cross-Validation
This study involves splitting data into a k-fold scale and comparing it with a k-1 fold scale to evaluate the test-retest reliability of factors identified in factor analysis It focuses on assessing the internal consistency among the original variables that contribute to each factor, ensuring robust and reliable results.
Data Dimension Reduction
Factor analysis term used to describe what it does with the data.
Data Mining
A field at the intersection of computer science and statistics, It attempts to discover patterns in large data sets.
Discretization
Converting continuous variables into discretized values in a regression model.
Discriminant Analysis
Multivariate method It is largely identical to factor analysis but goes one step further
In statistical modeling, incorporating a predictor variable, such as treatment modality, is essential for assessing its significance in clinical improvement The inquiry into whether treatment modality significantly predicts clinical outcomes can be approached by examining if clinical improvement influences the likelihood of receiving a specific treatment While this may initially appear counterintuitive, using outcomes for predictions is mathematically valid This approach effectively leverages linear cause-and-effect relationships, even when dealing with complex outcome variables.
Eigenvectors
Eigenvectors play a crucial role in factor analysis, representing the positions of original variables in relation to new factors, known as eigenvalues The scree plot visually illustrates the relative significance of these new factors alongside the original variables by utilizing their eigenvector values.
Elastic Net Regression
Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors.
Factor Analysis
Two or three unmeasured factors are identified to explain a much larger number of measured variables.
Factor Analysis Theory
ALAT (alanine aminotransferase), ASAT (aspartate aminotransferase), and gammaGT (gamma glutamyl transferase) provide insights into liver function, while urea, creatinine, and creatinine clearance indicate renal function To predict morbidity and mortality based on these variables, multiple regression is often employed; however, it is only valid when there is minimal correlation among the variables High correlation, known as collinearity or multicollinearity, can hinder the simultaneous use of these variables in a regression model, necessitating alternative methods Factor analysis offers a solution by replacing the original variables with a limited number of new variables, or factors, that maintain the highest correlation with the originals This multivariate technique, akin to MANOVA, utilizes an orthogonal approach to present the new factors, effectively mitigating the effects of collinearity by ensuring that the covariance between the variables is zero Ultimately, factor analysis transforms manifest predictor variables into latent predictor variables, enhancing the accuracy of analyses.
7 way it can be considered univariate method, but mathematically it is a multivariate method, because multiple rather than single latent variables are constructed from the predictor data available.
Factor Loadings
Factor loadings represent the correlation coefficients between the original variables and the newly estimated latent factor, while accounting for all original variables and any differences in measurement units.
Fuzzy Memberships
The universal spaces are divided into equally sized parts called membership functions.
Fuzzy Modeling
A method for modeling soft data, like data that are partially true or response patterns that are different at different times.
Fuzzy Plots
Graphs summarizing the fuzzy memberships of (for example) the imput values.
Generalization
Ability of a machine learning algorithm to perform accurately on future data.
Hierarchical Cluster Analysis
The article explores the idea that patients with similar characteristics may also exhibit related responses in areas such as drug efficacy This approach relies on large datasets and is recognized as a computationally intensive method, often classified among advanced analytical techniques in healthcare research.
Explorative data mining offers eight effective methods for analyzing drug efficacy, focusing on patients as dependent variables rather than relying on new characteristics This approach may provide more relevant insights compared to traditional machine learning techniques, such as factor analysis.
Internal Consistency Between the Original Variables
Contributing to a Factor in Factor Analysis
A strong correlation among responses to questions within a single factor is essential, as all questions should ideally predict the same outcome This correlation is quantified using Cronbach’s alpha, where a value of 0 indicates a poor relationship and 1 signifies a perfect correlation Additionally, the test-retest reliability of the original variables must be evaluated by omitting one variable; at least 80% of the data files with a missing variable should yield results consistent with those of the complete data file, indicated by alphas greater than 0.80.
Iterations
Complex mathematical models can be challenging to process, even for modern computers Current software packages utilize a method known as iterations, where multiple calculations are performed, and the one that best fits the criteria is selected.
Lasso Regression
Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0.
Latent Factors
Latent factors refer to underlying variables identified in factor analysis that are not directly measured but are inferred from the original data.
Learning
This term would largely fit the term “fitting” in statistics.
Learning Sample
Previously observed outcome data which are used by a neural network to learn to predict future outcome data as close to the observed values as possible.
Linguistic Membership Names
Each fuzzy membership is given a name, otherwise called linguistic term.
Linguistic Rules
The relationships between the fuzzy memberships of the imput data and those of the output data.
Logistic Regression
Logistic regression is akin to linear regression but differs in that it features a binary dependent variable, indicating whether an outcome is a responder or non-responder This binary classification is assessed through the log odds of the response variable.
Machine Learning
Knowledge for making predictions, obtained from processing training data through a computer Particularly modern computationally intensive methods are increasingly used for the purpose.
Monte Carlo Methods
Iterative testing in order to find the best fit solution for a statistical problem.
Multicollinearity or Collinearity
There should not be a strong correlation between different original variable values in a conventional linear regression Correlation coefficient (R) > 0.80 means the presence of multicollinearity and, thus, of a flawed multiple regression analysis.
Multidimensional Modeling
In data visualization, the y- and x-axes represent two factors, while a third factor can be illustrated using a z-axis to create a 3D graph Although additional factors can be incorporated into the model, they cannot be visually represented in 2D or 3D formats However, software programs can efficiently handle multidimensional calculations, akin to those used in multiple regression modeling.
Multilayer Perceptron Model
Neural network consistent of multiple layers of artificial neurons that after having received a signal beyond some threshold propagates it forward to the next later.
Multivariate Machine Learning Methods
The methods that always include multiple outcome variables They include dis- criminant analysis, canonical regression, and partial least squares.
Multivariate Method
Statistical analysis method for data with multiple outcome variables.
Network
This term would largely fit the term “model” in statistics.
Neural Network
Distribution-free method for data modeling based on layers of artificial neurons that transmit imputed information.
Optimal Scaling
The problem with linear regression is that consecutive levels of the variables are assumed to be equal, but in practice this is virtually never true Optimal scaling is a
11 method designed to maximize the relationship between a predictor and an outcome variable by adjusting their scales It makes use of discretization and regularization methods (see there).
Overdispersion, Otherwise Called Over fi tting
The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continuous variables.
Partial Correlation Analysis
A meaningful data analysis requires consideration of the significant interactions between independent variables; otherwise, the results may be misleading To address this, the study could be replicated with the interacting variables held constant Alternatively, employing partial correlation analysis can effectively isolate the effects of these interactions by controlling for the interacting variables.
Partial Least Squares
Partial least squares analysis is a multivariate method that, like factor analysis, identifies latent variables but differs in its approach by using a predetermined cluster of 4 or 5 predictor variables instead of all available predictors Unlike factor analysis, which ignores response variables, partial least squares analysis incorporates them, resulting in a more accurate fit for the response variable Additionally, this method produces correlation coefficients from multivariate linear regression rather than fitted correlation coefficients along the x and y-axes.
Pearson’s Correlation Coef fi cient (R)
R is a statistical measure that quantifies the strength of the relationship between two variables Its value ranges from -1 to +1, where 0 indicates no correlation, -1 signifies a perfect negative correlation, and +1 represents a perfect positive correlation A stronger association enhances the predictive capability of one variable over the other.
Principal Components Analysis
Radial Basis Functions
Symmetric functions around the origin, the equations of Gaussian curves are radial basis functions.
Radial Basis Function Network
Neural network, that, unlike the multilayer perceptron network, uses Gaussian instead of sigmoidal activation functions for transmission of signals.
Regularization
Correcting discretized variables for overfitting, otherwise called overdispersion.
Ridge Regression
Important method for shrinking b-values for the purpose of adjusting overdispersion.
Splines
Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes.
Supervised Learning
Machine learning using data that include both input and output data (exposure and outcome data).
Training Data
The output data of a supervised learning data set.
Triangular Fuzzy Sets
A common way of drawing the membership function with on the x-axis the imput values, on the y-axis the membership grade for each imput value.
Universal Space
A term often used with fuzzy modeling: the defined range of imput values, and defined range of output values.
Unsupervised Learning
Machine learning using data that includes only input data (exposure data).
Varimax Rotation
A "2 factor" analysis demonstrates that slight rotations of the x and y-axes can enhance model fitting This simultaneous rotation, known as varimax rotation, assumes the two new factors are completely independent However, if independence is not applicable, the x and y-axes can be rotated separately to achieve the optimal model fit for the data.
Weights
This term would largely fit the term “parameters” in statistics.
Machine learning continues to evolve, introducing a variety of novel terms that are essential for understanding the field This article will explore additional terminology in upcoming chapters, and a comprehensive index will provide a clearer overview of these concepts.
Sometimes machine learning is discussed as a discipline conflicting with statistics
There are notable differences in terminology and approach between machine learning and statistics Machine learning is primarily conducted by computer scientists who often come from diverse fields such as psychology, biology, and economics, rather than traditional statisticians with strong mathematical foundations The field of computer science is modern, featuring appealing terminologies, lucrative job opportunities, and promising income prospects compared to statistics It typically deals with larger and more complex data sets and emphasizes prediction modeling over null hypothesis testing However, the limited mathematical training of computer scientists can lead to a lack of awareness regarding the limitations of the models they employ.
Machine learning is a powerful tool for making predictions based on the analysis of training data processed by computers Given the complexity of data sets with multiple variables, advanced computational methods are essential for effective analysis This book highlights significant machine learning techniques applicable to health care and research, despite their limited use in the field to date.
Logistic regression is one of the earliest machine learning techniques applied in health research, utilized for health profiling by predicting the risk of medical events in individuals based on specific combinations of predictor variables.
3 A wonderful method for analyzing imperfect data with multiple variables is optimal scaling (Chaps 3 and 4).
4 Partial correlations analysis is the best method for removing interaction effects from large clinical datasets (Chap 5).
5 Mixed linear modeling (1), binary partitioning (2), item response modeling
Time-dependent predictor analysis, autocorrelation, and linear or log-linear regression methods are effective for evaluating data with repeated measures, binary decision trees, exponential exposure-response relationships, varying values over time, and seasonal differences These methodologies are discussed in detail in Chapters 6 through 10.
Clinical data sets exhibiting non-linear relationships between exposure and outcome variables necessitate specialized analysis techniques Neural network methods, such as multilayer perceptron networks and radial basis function networks, are often effective for this type of analysis.
Clinical data featuring multiple exposure variables are typically analyzed using analysis of (co-)variance (AN(C)OVA); however, this method fails to sufficiently address the relative significance of the variables and their interactions In contrast, factor analysis and hierarchical cluster analysis effectively overcome these limitations.
When analyzing data with multiple outcome variables, researchers often use multivariate analysis of (co-) variance (MAN(C)OVA), which shares the same limitations as ANOVA To overcome these challenges, methods such as partial least squares analysis, discriminant analysis, and canonical regression provide more robust solutions.
9 Fuzzy modeling is a method suitable for modeling soft data, like data that are partially true or response patterns that are different at different times (Chap 19).
1 O’Connor B (2012) Statistics versus machine learning, fight http://brenocon.com/blog/2008/12 Accessed 25 Aug 2012
T.J Cleophas and A.H Zwinderman, Machine Learning in Medicine,
DOI 10.1007/978-94-007-5824-7_2, © Springer Science+Business Media Dordrecht 2013
Logistic regression can be used for predicting the probability of an event in subjects at risk.
Methods and Results
It uses log linear models of the kind of the one underneath (ln = natural logarithm, a = intercept, b = regression coeffi cient, x = predictor variable ):
A comprehensive study involving 1,000 participants of varying ages tracked myocardial infarction occurrences over a decade This data enabled precise risk calculations for predicting future myocardial infarction events in individual subjects.
Conclusions
1 The methodology is currently an important way to determine, with limited health care sources, what individuals are at low risk and will, thus, be:
(3) given the assignment to be treated or not
(4) given the “do not resuscitate sticker”
Logistic Regression for Health Profi ling
2 We must take into account that some of the predictor variables may be heavily correlated with one another, and the results may, therefore, be infl ated
3 Also, the calculated risks may be true for subgroups, but for individuals less so, because of the random error
Logistic regression is a powerful statistical method used to predict the likelihood of an event occurring, such as the probability of an infarction This technique calculates the odds of an infarction by analyzing data from a group of patients, allowing for informed decision-making in medical assessments.
The odds of an infarction in a group is correlated with age, the older the patient the larger the odds
According to Fig 2.1 the odds of infarction is correlated with age, but we may ask how?
According to Fig 2.2 the relationship is not linear, but after transformation of the odds values on the y-axis into log odds values the relationship is suddenly linear.
We will, therefore, transform the linear equation y= +a bx
Fig 2.1 In a group of multiple ages the numbers of patients at risk of infarction is given by the dotted line
2 Logistic Regression for Health Profi ling
19 into a log linear equation ( ln = natural logarithm) lnodds= +a b x x( =age)
Our group consists of 1,000 subjects of different ages that have been observed for
10 years for myocardial infarctions Using SPSS statistical software, we command binary logistic regression dependent variable infarction yes/no independent variable a
The program produces a regression equation: lnodds ln pts with infarctions pts without a bx
The age is, thus, a signifi cant determinant of odds infarction (which can be used as surrogate for risk of infarction)
Fig 2.2 Relationships between the odds of infarction and age
Then, we can use the equation to predict the odds of infarction from a patient’s age:
The likelihood of infarction can be more accurately assessed by analyzing multiple independent variables For instance, a study tracking 10,000 patients over a decade records both infarction occurrences and baseline characteristics, allowing for a comprehensive evaluation of the factors influencing the risk of heart attacks.
(predictors) gender age Bmi (body mass index) systolic blood pressure cholesterol heart rate diabetes antihypertensives previous heart infarct smoker
The data are entered in SPSS, and it produces b-values (predictors of infarctions) b-values p-value
It is decided to exclude predictors that have a p-value > 0.10
The underneath regression equation is used
"lnodds infarct= +a b x 1 1 +b x 2 2 +b x 3 3 +ẳ." to calculate the best predictable y-value from every single combination of x-values
2 Logistic Regression for Health Profi ling
For instance, for a subject with the following characteristics (= predictor variables)
– smoker (x 10 ) the calculated odds of having an infarction in the next 10 years is the following: b-values x-values
Ln odds infarct = −0.5522 odds infarct = 0.58 = 58/100
The odds is often interpreted as risk However, the true risk is a bit smaller than the odds, and can be found by the equation risk event=1 1 1 / ( + /odds)
If odds of infarction = 0.58, then the true risk of infarction = 0.37
The above methodology is currently an important way to determine, with limited health care sources, what individuals will be:
3 given the assignment to be treated or not
4 given the “do not resuscitate sticker”
A comprehensive database is essential for accurately determining b-values, particularly in logistic models that convert predictive variables into event probabilities for individual subjects This approach is prevalent in medicine, exemplified by the TIMI (Thrombolysis In Myocardial Infarction) risk score, and is also gaining traction in strategic management, psychological testing, and other fields While linear regression typically utilizes the squared correlation coefficient (r²) to assess model fit, logistic models lack a direct equivalent Nevertheless, pseudo-R² and similar metrics have been developed to evaluate the strength of association between predictors and events in these models.
Logistic regression is a valuable tool for predicting the risk of significant events such as cardiovascular incidents, cancer diagnoses, and mortality rates However, it is essential to recognize its limitations, as this method relies on observational data, which can lead to potential misinterpretations of the results.
1 The assumption that baseline characteristics are independent of treatment effi cacies may be wrong
2 Sensitivity of testing is jeopardized if the models do not fi t the data well enough
3 Relevant clinical phenomena like unexpected toxicity effects and complete remissions can go unobserved
4 The inclusion of multiple variables in regression models raises the risk of clinically unrealistic results
A study was conducted to evaluate the risk factors for endometrial cancer among postmenopausal women The primary focus was to identify the determinants of this cancer in this specific group To analyze the data, a logistic regression model was employed, where the dependent variable represented the natural logarithm of the odds of developing endometrial cancer, and one of the key independent variables was short-term estrogen consumption.
Table 2.1 Examples of predictive models where multiple logistic regression has been applied
Dependent variable (odds of event) Independent variables (predictors)
Odds of infarction Age, comorbidity, comedication, riskfactors
2 Car producer (Strategic management research) [ 2 ]
Odds of successful car Cost, size, horse power, ancillary properties
3 Item response modeling (Rasch models for computer adapted tests) [ 3 ]
Odds of correct answer to three questions of different diffi culty
Correct answer to three previous questions
2 Logistic Regression for Health Profi ling
Long-term estrogen consumption significantly impacts women's health, leading to a series of interconnected risks Specifically, it is associated with a low fertility index, obesity, hypertension, and early menopause, all of which contribute to an increased likelihood of developing endometrial cancer The odds ratios reveal that consumers of estrogen have a higher chance of cancer compared to non-consumers, while patients with low fertility and obesity also exhibit elevated cancer risks These findings underscore the importance of understanding the regression coefficients, standard errors, and p-values in assessing the odds ratios related to these risk factors.
The software analysis revealed significant b-values indicating a heightened cancer risk associated with factors such as estrogen consumption, low fertility, obesity, and hypertension, with an alarming risk estimate of up to 76-fold However, this figure is likely exaggerated due to the strong correlations among these variables While logistic regression serves as a valuable exploratory research tool, its findings should be interpreted cautiously, particularly when applied to individual health profiling The calculated risks may hold true for specific subgroups, but individual assessments may be less reliable due to random error.
Logistic regression can be used for predicting the probability of an event It uses log linear models of the kind of the one underneath:
The methodology is currently an important way to determine, with limited health care sources, what individuals will be:
3 given the assignment to be treated or not
4 given the “do not resuscitate sticker”
It's important to recognize that certain variables may be strongly correlated, leading to inflated results Additionally, while calculated risks might apply to subgroups, they may not accurately reflect individual circumstances due to random error.
1 Antman EM, Cohen M, Bernink P, McGabe CH, Horacek T, Papuches G, Mautner B, Corbalan
R, Radley D, Braunwald E (2000) The TIMI risk score for unstable angina pectors, a method for prognostication and therapeutic decision making J Am Med Assoc 284:835–842
2 Hoetner G (2007) The use of logit and probit models in strategic management research Strateg Manag J 28:331–343
3 Rudner LM Computer adaptive testing http://edres.org/scripts/cat/catdemo.htm Accessed 18 Dec 2012
2 Logistic Regression for Health Profi ling
T.J Cleophas and A.H Zwinderman, Machine Learning in Medicine,
DOI 10.1007/978-94-007-5824-7_3, © Springer Science+Business Media Dordrecht 2013
In clinical trials, researchers frequently utilize multiple regression analysis to evaluate various variables related to their research questions However, a key limitation of multiple regression is its assumption that the levels of these variables are equal, which is rarely the case in real-world scenarios To address this issue, optimal scaling offers a solution by enhancing the relationship between predictor and outcome variables through scale adjustments.
A simulated drug efficacy trial involving 27 variables was conducted to evaluate the effectiveness of optimal scaling compared to traditional multiple linear regression The analysis utilized the Optimal Scaling module in SPSS to assess performance outcomes.
Results
The two methods produced similarly sized results with 7 versus 6 p-values < 0.10 and 3 versus 4 p-values < 0.010 respectively
Conclusions
1 Optimal scaling using discretization is a method for analyzing clinical trials where the consecutive levels of the variables are inequal
2 In order to fully benefi t from optimal scaling a regularization procedure for the purpose of correcting overdispersion is desirable
In clinical trials, research questions are often assessed using multiple variables, such as gene expressions to predict the efficacy of cytostatic treatments, repeated measurements in randomized longitudinal trials, and multi-item personal scores for evaluating antidepressants Multiple linear regression analysis is commonly employed to evaluate the impact of predictor variables on outcome variables; however, a limitation of this method is the assumption that consecutive levels of predictor variables are equal, which is rarely the case in practice For instance, a continuous predictor variable scored on a scale of 0–10 may have missing values at certain points, and various scales, such as those with two or four parts, can be utilized, highlighting the arbitrary nature of scale selection in research.
Table 3.1 illustrates that each scale yielded distinct results, with scales 2 and 3 showing a gradual improvement in t-values and p-values Optimal scaling, developed by Albert Gifi at UCLA in 1990, aims to enhance the relationship between predictor and outcome variables using a computationally intensive method that employs quadratic approximation to identify the best scale for the data This technique is extensively discussed in statistical literature and is a significant aspect of machine learning, which focuses on enabling computers to make predictions from complex empirical data However, its application in clinical research remains limited, with only a few genetic studies identified in Medline.
[ 11 , 12 ], epidemiological [ 13 , 14 ] and psychological [ 15 ] studies, but very few therapeutic trials [ 16 , 17 ], despite its pleasant property to improve the p-values of testing and, thus, turn negative results into positive ones
This chapter presents a simulated example to evaluate the effectiveness of optimal scaling within a multiple variables model Our aim is to encourage clinical researchers to adopt this enhanced analytical approach for predictive trials.
Fig 3.1 Linear regression analysis An example of a continuous predictor variable (x-variable) on a scale 0–10 Patients with the predictor values 0, 1, 5, 9 and 10 are missing
Table 3.1 Linear regression analysis of the data from Fig 3.1 using three different scales
With the scales 2 and 3 a gradual improvement of the t-values and p-values is observed a Dependent variable: outcome
Cross-Validation
Splitting the data into a k-fold scale and comparing it with a k−1 fold scale.
Discretization
Converting continuous variables into discretized values in a regression model.
Elastic Net Regression
Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors.
Lasso Regression
Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0.
Overdispersion, Otherwise Called Over fi tting
The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continuous variables.
Monte Carlo Methods
Iterative testing in order to fi nd the best fi t solution for a statistical problem.
Regularization
Correcting discretized variables for overfi tting, otherwise called overdispersion
Ridge Regression
Important method for shrinking b-values for the purpose of adjusting overdispersion.
Splines
Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes
Optimal scaling makes use of processes like discretization (converting continuous variables into discretized values), and regularization (correcting discretized variables for overfi tting, otherwise called overdispersion) [ 8 – 11 ]
To convert continuous data into a discrete model, the quadratic approximation is an effective method, represented by the formula fx = fa + f [1]a (x−a), where f [1]a denotes the first derivative of the function at point a This approximation relies on the principle that the quadratic model is the simplest alternative to linear models The first derivative, which indicates the slope of the function, effectively captures the function's magnitude This technique is particularly useful for analyzing complex functions, such as standard errors, and for determining the optimal discretization distance between a given x-value and its nearest a-value, which serves as the best fit scale for the data.
To enhance the best fit scale for a variable, SPSS offers the capability to split a linear variable into two segments, known as splines, which allows for the combination of two linear functions to effectively model non-linear patterns.
The study analyzed a dataset of 250 patients, encompassing 27 variables related to microarray gene expression levels and drug efficacy scores, with the complete data file available in the appendix All variables were standardized on an 11-point linear scale ranging from 0 to 10 Notably, genes 1–4, 16–19, and 24–27 exhibited high expression levels The outcome was assessed using composite scores derived from variables 20–23.
The analysis utilized SPSS statistical software to conduct a traditional multiple linear regression, with gene expression levels serving as predictors and the drug efficacy composite score as the outcome variable The results, detailed in Table 3.2, indicate a significant overall r-square value.
0.725 In order to improve the scaling of the linear regression model the Optimal Scaling program of SPSS was used.
Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var 28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1,
2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….OK
Table 3.3 presents the results, indicating that the intercept has been removed and t-tests have been substituted with F-tests The optimally scaled model, which does not employ regularization, demonstrates effects similar in size to those of the traditional multiple linear regression model, with a slightly higher overall R-squared value of 0.736 compared to 0.725, and an increase in significant p-values from 3 to 4 for p < 0.010 To maximize the benefits of optimal scaling, implementing a regularization procedure to address overdispersion is recommended, which will be discussed in the next chapter.
Traditional linear regression struggles with multiple independent variables due to the unequal levels of these variables Optimal scaling is an effective method to address this issue, but it may lead to power loss as a result of overdispersion.
Table 3.2 Traditional multiple linear regression with drug effi cacy score (a composite score of the variables 20–23) as outcome and 12 gene expression levels as predictor
A sharp increase in t-values for certain x-values can occur when other x-values are removed, a phenomenon known as unstable regression coefficients or "bouncing betas." This instability often arises from correlated predictors or when there are too many predictors relative to the number of observations Shrinking regression coefficients has been shown to effectively address overdispersion and mitigate this instability.
Optimal scaling is a computationally intensive method often categorized under machine learning, as it generates predictive models through computer programs Other machine learning techniques include factor analysis, partial least squares (PLS), canonical regression, item response modeling, and neural networks It's important to note that optimal scaling is not a competitor to these methods; instead, it offers a distinct approach, particularly advantageous when there is a lack of homogeneity in the scales of the variables.
For those interested in analyzing the combined effects of specific subsets of variables, methods such as [20] and [21] are more appropriate In contrast, if your primary focus is to evaluate the collective impact of all predictor variables, canonical regression [22] is the most suitable choice.
The method has notable limitations, particularly regarding the independence of scale intervals from the outcome variable's magnitude, which may not always hold true Additionally, the data spread can be exaggerated due to wide scale intervals It is also important to consider that utilizing multiple scales may yield better results.
The current chapter shows that in order to fully benefi t from optimal scaling a regularization procedure is desirable Three methods are explained
Table 3.3 Optimal scaling without regularization
Standardized coeffi cients df F Sig
Boots trap (1,000) Estimate of Std error
1 Optimal scaling using discretization is a method for analyzing clinical trials where the consecutive levels of the variables are inequal
2 In order to fully benefi t from optimal scaling a regularization procedure for the purpose of correcting overdispersion is desirable
8 Appendix: Datafi le of 250 Subjects Used as Example
8 Appendix: Datafi le of 250 Subjects Used as Example
8 Appendix: Datafi le of 250 Subjects Used as Example
A study by Tsao et al (2010) published in DNA Cell Biology explores gene expression profiles to predict the efficacy of the anticancer drug 5-fluorouracil in breast cancer treatment The research highlights the potential of utilizing specific gene expressions as biomarkers for enhancing therapeutic outcomes in breast cancer patients.
In a study by Latan et al (2012), the researchers demonstrated that a microemulsion-based tacrolimus cream effectively suppresses cytokine gene expression, enhancing its therapeutic efficacy for treating atopic dermatitis The findings, published in Drug Delivery and Translational Research, highlight the potential of this formulation in improving treatment outcomes for patients with this skin condition.
3 Albertin PS (1999) Longitudinal data analysis (repeated measures) in clinical trials Stat Med 18:2863–2870
4 Yang X, Shen Q, Xu H, Shoptaw S (2007) Functional regression analysis using an F test for longitudinal data with large numbers of repeated measures Stat Med 26:1552–1566
5 Sverdlov L (2001) The fastclus procedure as an effective way to analyze clinical data In: SUGI proceedings 26, paper 224, Long Beach, CA
6 Gifi A (1990) Non linear multivariate analysis Department of Data Theory, Leiden
7 Alpaydin E (2004) Introduction to machine learning http://books.google.com Accessed 25 June 2012
8 Van der Kooij AJ (2007) Prediction accuracy and stability of regression with optimal scaling transformations Ph.D thesis, Leiden University, Netherlands
9 Hojsgaard S, Halekoh U (2005) Overdispersion Danish Institute of Agricultural Sciences Copenhagen http://gbi.agrsci.dk/statistics/courses Accessed 18 Dec 2012
10 Wang L, Gordon MD, Zhu J (2006) Regularized least absolute deviations regression and an effi cient algorithm for parameter tuning In: Sixth international conference data mining 2006 doi: 10.1109/ICDM.2006.134
11 Waaijenberg S, Zwinderman AH (2007) Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers BMC Proc 1(Suppl 1): S122–S125
In a study by Yoshiwara et al (2010), a gene expression profile was developed to predict survival outcomes in patients with advanced stage serous ovarian cancer This research analyzed data from two independent datasets, highlighting the potential of genetic markers in enhancing prognostic accuracy for this aggressive cancer type The findings were published in PLoS One, demonstrating the significance of molecular profiling in clinical decision-making for ovarian cancer management.
13 Gururajan R, Quaddus M, Xu J (2008) Clinical usefulness of handheld wireless technology in healthcare J Syst Info Technol 10:72–85
14 Kitsiou S, Manthou V, Vlachopoulou M, Markos A (2010) Adoption and sophistication of clinical information systems in geek public hospitals 12th Med Confer Medical Biological Engineering 29:1011–1016
15 Hartmann A, Van der Kooij AJ, Zeeck A (2009) Models of clinical decision making by regression with optimal scaling Psychother Res 19:482–492
16 Triantafi lidou K, Venetis G, Markos A (2012) Short term results of autologous blood Injection for treatment of habitual TMJ luxation J Craniofac Surg 23(3):689–692
17 Li Y (2008) Statistical methods in surrogate marker research http>//deepblue.lib.umich.edu/ handle Accessed 18 Dec 2012
18 SPSS statistical software (2012) www.spss.com Accessed 12 June 2012
19 Cleophas TJ, Zwinderman AH (2011) Statistics applied to clinical studies Springer, New York
20 Barthelemew DJ (1995) Spearman and the origin and development of factor analysis Br J Math Stat Psychol 48:211–220
21 Wold H (1966) Estimation of principle components and related models by iterative least squares In: Krishnaiah PR (ed) Multivariate analysis Academic Press, New York, pp 391–420
In their 2009 paper presented at the IJCAI conference on artificial intelligence, Sun et al explore the relationship between canonical correlation analysis and orthonormalized partial least squares, highlighting their equivalence The study, published by Morgan Kaufman Publishers in San Francisco, provides valuable insights into these statistical methods, contributing to the understanding of their applications in artificial intelligence research.
23 Sherlock C, Roberts G (2009) Optimal scaling of random walk Bernoulli 15:774–798
T.J Cleophas and A.H Zwinderman, Machine Learning in Medicine,
DOI 10.1007/978-94-007-5824-7_4, © Springer Science+Business Media Dordrecht 2013
In the previous chapter, we highlighted a significant limitation of linear regression in clinical research: it assumes equal intervals between consecutive levels of predictor variables (x-variables), which rarely reflects real-world scenarios This discrepancy can impact the analysis of the effects of these predictors on outcome variables (y-axis variables).
Objective
In the current chapter we will address the subject of regularization, a method for correcting discretized variables for overdispersion.
Methods
Ridge regression, lasso regression, and elastic net regression will be demonstrated using the example from the previous chapter once more.
Results
The ridge optimal scaling model produced eight p-values < 0.01, while traditional regression and unregularized optimal scaling produced only 3 and 2 p-values < 0.01
Including Ridge, Lasso, and Elastic
Lasso optimal scaling eliminated 4 of 12 predictors from the analysis, while, of the remainder, only two were signifi cant at p < 0.01 Similarly elastic net optimal scaling did not provide additional benefi t.
Conclusions
1 Optimal scaling shows similarly sized effects compared to traditional regression
In order to benefi t from optimal scaling a regularization procedure for the pur- pose of correcting overdispersion is desirable
2 Ridge optimal scaling performed much better than did traditional regression giv- ing rise to many more statistically signifi cant predictors
3 Lasso optimal scaling shrinks some b-values to zero, and is particularly suitable if you are looking for a limited number of strong predictors
4 Elastic net optimal scaling works better than lasso if the number of predictors is larger than the number of observations
In the previous chapter, we highlighted that linear regression, frequently utilized in clinical research to assess the impact of predictor variables (x-variables) on outcome variables (y-axis), faces a significant issue This method assumes that consecutive levels of predictor variables are equal, a condition that rarely holds true in real-world scenarios.
Optimal scaling significantly enhances testing sensitivity, but it is crucial to address the risk of data overdispersion to fully leverage this methodology Robert Tibshirani, a statistics professor at Stanford University, proposed solutions in 1996, highlighting the potential of shrinking regression coefficients to improve the fit of regression models.
In the current chapter, using the example from the previous chapter, we will address the subject of regularization, a method for correcting discretized variables for overfi tting, otherwise called overdispersion
Discretization
Converting continuous variables into discretized values in a regression model
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
Splines
Cut pieces of a non-linear graph, originally thin wooden strips for modeling cars and airplanes.
Overdispersion, Otherwise Called Over fi tting
The phenomenon that the spread in the data is wider than compatible with Gaussian modeling This phenomenon is particularly common with discretization of continu- ous variables.
Regularization
Correcting discretized variables for overfi tting, otherwise called overdispersion.
Ridge Regression
Important method for shrinking b-values for the purpose of adjusting overdispersion
Iterative testing in order to fi nd the best fi t solution for a statistical problem.
Cross-Validation
Splitting the data into a k-fold scale and comparing it with a k-1 fold scale.
Lasso Regression
Shrinking procedure slightly different from ridge regression, because it shrinks the smallest b-values to 0
Elastic Net Regression 42Contents
Shrinking procedure similar to lasso, but made suitable for larger numbers of predictors
Optimal scaling utilizes techniques such as discretization, which transforms continuous variables into discrete values, and regularization, which adjusts these discrete variables to prevent overfitting, also known as overdispersion Discretization was discussed in the previous chapter.
Regularization is essential for correcting overdispersed models, typically increasing the standard error Various adjustment methods exist, with Hojsgaard and Halekoh recommending the [chi-square/degrees of freedom] ratio A common approach is ridge regression, which minimizes the regression coefficient b using a shrinking factor λ, where b ridge = b/(1 + λ) This method ensures that a suitable λ value can yield a better scale model than traditional linear models The Monte Carlo approach allows for multiple tests to identify the best-fit scale, while k-fold cross-validation is a standard technique for model evaluation In addition to ridge regression, SPSS provides lasso regression, which reduces the size of b-values by setting the smallest ones to zero, enhancing prediction accuracy with fewer strong predictors Conversely, ridge regression is preferable for complex models with many weak predictors Elastic net regression combines features of both lasso and ridge, performing better when the number of predictors exceeds the number of observations.
The data file from 250 patients, detailed in the previous chapter and available in the appendix, comprises 27 variables that encompass both microarray gene expression levels and drug efficacy scores Each variable was standardized using an 11-point linear scale ranging from 0 to 10, with a focus on identifying genes that exhibited significant results.
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
43 expressed: the genes 1–4, 16–19, and 24–27 As outcome variable composite scores of the variables 20–23 were used
The optimally scaled model without regularization demonstrates effects comparable to the traditional regression model, achieving a slightly higher overall R-squared value of 0.736 To maximize the benefits of optimal scaling, implementing a regularization procedure is essential to address overdispersion Initially, a ridge path model will be employed, utilizing SPSS statistical software for analysis.
Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var 28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1,
2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….click Regularization….mark Ridge….OK
Figure 4.1 presents the adjusted b-values from the optimal Ridge scale model, as detailed in Table 4.1 The graph illustrates that the b-values for various predictors rise progressively as the factor λ decreases Additionally, the right vertical line indicates a scenario where the data spread has increased by one standard error beyond the best model represented by the left line.
Figure 4.1 illustrates the optimal scaling in ridge regression, displaying the adjusted b-values of the best-fit scale model on the left vertical line, as detailed in Table 4.1 The graph demonstrates that as the shrinking factor λ decreases from the left to the right, the b-values of various predictors gradually increase The right vertical line indicates a scenario where the data spread has risen by one standard error above the best model, leading to a corresponding deterioration in model performance.
44 deteriorated correspondingly The sensitivity of this model is much better than the traditional regression with 8 p-values < 0.01, while the traditional and unregularized Optimal Scaling only produced 3 and 2 p-values < 0.01.
Also the lasso regularization model is possible (Var = variable).
Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var
28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1, 2, 3, 4,
16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)…. Discretize: Method Grouping, Number categories 7)….click Regularization…. mark Lasso….OK
The adjusted b-values from the optimal lasso scale model, illustrated in Figure 4.2 and detailed in Table 4.2, indicate that genes 1, 3, 25, and 27 have been reduced to zero and subsequently removed from the analysis Lasso regression is particularly effective for identifying a limited set of predictors, enhancing prediction accuracy by excluding weaker predictors.
Finally, the elastic net method is applied.
Command: Analyze….Regression….Optimal Scaling….Dependent Variable: Var 28 (Defi ne Scale: mark spline ordinal 2.2)….Independent Variables: Var 1,
2, 3, 4, 16, 17, 18, 19, 24, 25, 26, 27 (all of them Defi ne Scale: mark spline ordinal 2.2)….Discretize: Method Grouping, Number categories 7)….click Regularization….mark Elastic Net….OK
Table 4.3 presents results that closely align with those obtained using Lasso In this case, Elastic Net does not offer significant advantages, yet it outperforms Lasso when the number of predictors exceeds the number of observations.
Table 4.1 Optimal scaling with ridge regression
Standardized coeffi cients df F Sig
Bootstrap (1,000) estimate of Std error
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
Table 4.2 Optimal scaling with lasso regression
Standardized coeffi cients df F Sig
Bootstrap (1,000) estimate of Std error
Figure 4.2 illustrates the lasso regression, displaying the adjusted b-values of the optimal scale model indicated by the left vertical line, with corresponding values available in Table 4.2 As the shrinking factor λ decreases from left to right on the graph, the b-values of various predictors progressively increase The right vertical line represents a scenario where the data spread has risen by one standard error above the best model, resulting in a corresponding deterioration of the model's performance.
Traditional linear regression struggles with multiple independent variables, often leading to unstable regression coefficients, commonly known as "bouncing betas." This instability occurs when predictors are correlated or when there are many predictors compared to the number of observations Shrinking regression coefficients has proven beneficial in addressing overdispersion and mitigating instability in the model.
Optimally scaled modeling demonstrated effects comparable to traditional linear regression To maximize the benefits of optimal scaling, implementing a regularization procedure to address overdispersion is recommended.
The ridge optimal scaling demonstrates superior sensitivity compared to traditional linear regression, resulting in a greater number of significant predictors in your model.
Lasso optimal scaling effectively reduces some variable b-values to zero, making it ideal for identifying a few strong predictors In contrast, elastic net optimal scaling outperforms lasso, especially when the number of predictors exceeds the number of observations.
We hope this paper will stimulate clinical investigators to start using this optimized analysis method for predictive trials
Table 4.3 Optimal scaling with elastic net (ridge 0.00 t/m/1.00)
Standardized coeffi cients df F Sig
Bootstrap (1,000) estimate of Std error
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
Optimally scaled modeling produces effects comparable to traditional linear regression, but to maximize its benefits, implementing a regularization procedure is essential for correcting overdispersion.
2 Particularly, the sensitivity of the ridge optimal scaling may be better than that of traditional linear regression giving rise to more signifi cant predictors in the data
3 Lasso optimal scaling shrinks some variable b-values to zero, and is, therefore, particularly suitable if you are looking for a limited number of strong predictors
4 Elastic net optimal scaling works better than lasso if the number of predictors is larger than the number of observations
8 Appendix: Datafi le of 250 Subjects Used as Example
8 Appendix: Datafi le of 250 Subjects Used as Example
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
8 Appendix: Datafi le of 250 Subjects Used as Example
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
8 Appendix: Datafi le of 250 Subjects Used as Example
4 Optimal Scaling: Regularization Including Ridge, Lasso, and Elastic Net Regression
1 Tibshirani R (1996) Regression shrinkage and selection via the lasso J R Stat Soc 58:
2 Alpaydin E (2004) Introduction to machine learning http://books.google.com Accessed 25 June 2012
3 Van der Kooij AJ (2007) Prediction accuracy and stability of regression with optimal scaling transformations Ph.D thesis Leiden University, Netherlands
4 Hojsgaard S, Halekoh U (2005) Overdispersion Danish Institute of Agricultural Sciences, Copenhagen http://gbi.agrsci.dk/statistics/courses Accessed 18 Dec 2012
5 Wang L, Gordon MD, Zhu J (2006) Regularized least absolute deviations regression and an effi cient algorithm for parameter tuning 6th Int Conf Data Min doi: 10.1109/ICDM.2006.134
6 SPSS statistical software (2012) www.spss.com Accessed 12 June 2012
T.J Cleophas and A.H Zwinderman, Machine Learning in Medicine,
DOI 10.1007/978-94-007-5824-7_5, © Springer Science+Business Media Dordrecht 2013
Clinical research outcomes are influenced by numerous interrelated factors, yet multiple regression analysis often assumes these factors operate independently This raises the question of why these variables should not have an impact on each other.
To assess the performance of partial regression analysis for the assessment of clinical trials with interaction between predictor factors.
A simulated 64 patient study of the effects of exercise on weight loss with calorie intake as covariate and a significant interaction on the outcome between the covariates.
The simple linear correlations of weight loss versus exercise and versus calorie intake were respectively 0.41 (p = 0.001) and −0.30 (p = 0.015) Multiple linear regression adjusted for interaction showed that exercise was no longer a significant