Biomedical Engineering Trends in Electronics Communications and Software Part 16 pptx

Realizing that if there exists correlation within the X’s, the variable importance for these variables could be inflated as the construction of variable importance measures departures fr

Trang 2

response variable, for each predictor variable X However, this method is at a disadvantage

when there are interactions present Another method is best subset selection which looks at the change in predictive accuracy for each subset of predictors When the number of parameters becomes large, examining each possible subset becomes computationally infeasible Methods such as forward selection and backwards elimination are also not likely

to yield the optimal subset in this case The third method uses all of the X’s to generate a

model and then use the model to examine the relative importance of each variable in the model Random Forests and its derivatives are machine learning tools that were primarily created as a predictive model and secondly as a way to rank the variable in terms of their importance to the model Random Forests are growing increasingly popular in genetics and

bioinformatics research They are applicable in the small n large p problems and can deal

with high-order interactions and non-linear relationships Although there are many machine learning techniques that are applicable for data of this type and can give measures of variable importance such as Support Vector Machines (Vapnik 1998; Rakotomamonjy 2003), neural networks (Bishop 1995), Bayesian variable selection (George and McCulloch 1993; George and McCulloch 1997; Kuo and Mallick 1999; Kitchen et al., 2007) and k-nearest neighbors (Dasarathy 1991), we will concentrate on Random Forests because of their relative ease of use, popularity and computational efficiency

2 Trees and Random Forests

Classification and regression trees (Breiman et al., 1984) are flexible, nonlinear and nonparametric They produce easily interpretable binary decision trees but can also overfit and become unstable (Breiman 1996; Breiman 2001) To overcome this problem several advances have been suggested It has been shown that for some splitting criteria, recursive binary partitioning can induce a selection bias towards covariates with many possible splits (Loh and Shih 1997; Loh 2002; Hothorn et al., 2006) The key to producing unbiasedness is to separate the variable selection and the splitting procedure (Loh and Shih 1997; Loh 2002; Hothorn et al., 2006) The conditional inference trees framework was first developed by Hothorn et al (Hothorn et al., 2006) These trees select variables in an unbiased way and are not prone to overfitting Let w ( w , ,w )= 1 n be a vector of non-negative integer valued case weights where the weights are non-zero when the corresponding observations are included

in the node and 0 otherwise The algorithm is as follows: 1) At each node test the null

hypothesis of independence between any of the X’s and the response Y, that is test

=

j

P(Y|X ) P(Y ) for all j: j=1,…,p If the null hypothesis cannot be rejected at alpha level less

than some pre-specified level then the algorithm terminates If the null hypothesis of independence is rejected then the covariate with the strongest association to Y is selected (that is, theX jwith the lowest p-value) 2) Split the covariate into two disjoint sets using permutation test to find the optimal binary split with the maximum discrepancy between the samples Note that other splitting criteria could be used 3) Repeat the steps recursively Hothorn asserts that compared to GUIDE (Loh 2002) and QUEST (Loh and Shih 1997), other unbiased methods for classification trees, conditional inference trees have similar prediction accuracy but conditional inference trees are intuitively more appealing as alpha has the more familiar interpretation of type I error instead being used solely as a tuning parameter, although it could be used as such Much of the recent work on extending classification and

Trang 3

Nonparametric Variable Selection Using Machine Learning Algorithms

in High Dimensional (Large P, Small N) Biomedical Applications 591 regression trees have been on growing ensembles of trees Bagging, short for bootstrap aggregation, whereupon many bootstrapped samples of the data are generated from a dataset with a separate tree grown for each sample was proposed by Breiman in 1996 This technique has been shown to reduce the variance of the estimator (Breiman 1996) The random split selection proposed by Dietterich 2000 also grows multiple trees but the splits are chosen uniformly at random from among the K best splits (Dietterich 2000) This method can be used either with or without pruning the trees Random split selection has better predictive accuracy than bagging (Dietterich 2000) Boosting, another competitor to bagging, involves iteratively weighting the outputs where the weights are inversely proportional to their accuracy, has excellent predictive accuracy but can degenerate if there is noise in the labels Ho suggested growing multiple trees where each tree is grown using a fixed subset

of variables (Ho 1998) Predictions were made by averaging the votes across the trees Predictive ability of the ensemble depends, in part, on low correlation between the trees Random Forests extends the random subspace method of Ho 1998 Random Forests belong

to a class of algorithms called weak learners and are characterized by low bias and high variance They are an ensemble of simple trees that are allowed to grow unpruned and were introduced by Breiman (Breiman 2001) Random Forests are widely applicable, nonlinear, non-parametric, are able to handle mixed data types (Breiman 2001; Strobl et al., 2007; Nicodemus et al., 2010) They are faster than bagging and boosting and are easily parallelized Further they are robust to missing values, scale invariant, resistant to over-fitting and have high predictive accuracy (Breiman 2001) Random forests also provide a ranking of the predictor variables in terms of their relative importance to the model A single tree is unstable providing different trees for mild changes within the data Together bagging, predictor subsampling and averaging across all trees helps to prevent over-fitting and increase stability Briefly Random Forests can be described by the following algorithm:

1 Draw a large number of bootstrapped samples from the original sample (the number of trees in the forest will equal the number of bootstrapped samples)

2 Fit a classification or regression tree on each bootstrapped sample Each tree is maximally grown without any pruning where at each node a randomly selected subset

of size mtry possible predictors from the p possible predictors are selected (where mtry

< p) and the best split is calculated only from this subset If mtry=p then it is termed

bagging and is not considered a Random Forest Note, one could also use a random linear combination of the subset of inputs for splitting as well

3 Prediction is based on the out of bag (OOB) average across all trees The out-of-bag (OOB) samples are the data that are not used in the test set (roughly 1/3 of the variables) and can be used to test the tree grown That is, for each pair (x ,y i i) in the training sample select only the trees that do not contain the pair and average across these trees

The additional randomness added by selecting a subset of parameters at random instead of

splitting on all possible parameters releases Random Forests from the small n, large p

problem (Strobl et al., 2007) and allows the algorithm to be adaptive to the data and reduces correlation among the trees in the forest (Ishwaran 2007) The accuracy of a Random Forest depends on the strength of the individual trees and the level of correlation between the trees (Breiman 2001) Averaging across all trees in the forest allows for good predictive accuracy and low generalization error

Trang 4

3 Use in biomedical applications

Random Forests are increasingly popular in the biomedical community and enjoy good predictive success even against other machine learning algorithms in a wide variety of applications (Lunetta et al., 2004; Segal et al., 2004; Bureau et al 2005; Diaz-Uriarte and Alvarez de Andes 2006; Qi, Bar-Joseph and Klein-Seetharaman 2006; Xu et al., 2007; Archer and Kimes 2008; Pers et al 2009; Tuv et al., 2009; Dybowski, Heider and Hoffman 2010; Geneur et al., 2010) Random Forests have been used in HIV disease to examine phenotypic properties of the virus Segal et al used Random Forests to examine the role of mutations in polymerase in HIV-1 to viral replication capacity (Segal et al., 2004) Random Forests have also been used to predict HIV-1 coreceptor usage from sequence data (Xu et al., 2007; Dybowski et al., 2010) Qi et al found that Random Forests had excellent predictive capabilities in the prediction of protein interaction compared to six other machine learning methods (Qi et al., 2006) Random Forests have also been found to have favorable predictive characteristics in microarray and genomic data (Lunetta et al., 2004; Bureau et al 2005; Lee

et al., 2005; Diaz-Uriarte and Alvarez de Andes 2006) These applications, in particular, use Random Forests as a prediction method and as a filtering method (Breiman 2001; Lunetta et al., 2004; Bureau et al 2005; Diaz-Uriarte and Alvarez de Andes 2006) To unbiasedly test between several machine learning algorithms, a game was devised where bootstrapped samples from a dataset were given to players who used different machine learning strategies specifically Support Vector Machines, LASSO, and Random Forests to predict an outcome Model performance was gauged by a separate referee using a strictly proper scoring rule In this setup, Pers et al found that Random Forests had the lowest bootstrap cross-validation error compared to the other algorithms (Pers et al 2009)

4 Variable importance in Random Forests

While variable importance in a general setting has been studied (van der Laan 2006) we will examine it in the specific framework of Random Forests In the original formulation of CART, variable importance was defined in terms of surrogate variables where the variable importance looks at the relative improvement summed over all of the nodes of the primary variable versus its surrogate There are a number of variable importance definitions for Random Forests One could simply count the number of times a variable appears in the forest as important variables should be in many of the trees But this would be a nạve estimator because the information about the hierarchy of the tree where naturally the most important variables are placed higher in the tree is lost One the other hand one could only look at the primary splitters of each tree in the forest and count the number of times that a variable is the primary splitter A more common variable importance measure is Gini Variable Importance (GVI) which is the sum of the Gini impurity decrease for a particular variable over all trees That is, Gini variable importance is a weighted average of a particular

variables improvement of the tree using the Gini criterion across all trees Let N be the number of observations at node j, and N RandN Lbe the number of observations of the right and left daughter nodes after splitting, and let d ijbe the decrease in impurity produced by variable X i at the jth node of the tth tree If Y is categorical, then the Gini index is given

byG ˆ =2 1ˆ p( −ˆ p) , where ˆp is the proportion of 1’s in the sample So in this case,

Trang 5

in High Dimensional (Large P, Small N) Biomedical Applications 593

ij ˆ N ˆ L N ˆ R

N N ; where ˆG and L ˆG are the Gini indexes of the left and right node R

respectively The Gini Variable importance of variable X i is defined as

where I ij is an indicator variable for whether the ith variable was used to split node j That

is, it is the average of the Gini importance over all trees, T

Permutation variable importance (PVI) is the difference in predictive accuracy using the original variable and a randomly permuted version of the variable That is, for variable X , i

count the number of correct votes using the out-of-bag cases and then randomly permute the same variable and count the number of correct votes using the out of bag cases The difference between the number of correct votes for the unpermuted and permuted variables averaged across all trees is the measure of importance

i

X variable for tree t

Strobl et al (Strobl et al 2008) suggested a conditional permutation variable importance measure for when variables are highly correlated Realizing that if there exists correlation within the X’s, the variable importance for these variables could be inflated as the construction of variable importance measures departures from independence of the variableX i from the outcome Y and also from the remaining predictor variables X ( i )− , they devised a new conditional permutation variable importance measure Here X ( i )− reflects the remaining covariates not including X i in other wordsX ( i )− ={X , ,X1 i−1, i X+1, ,X } p The new measure is obtained by conditionally permuting values of X i within groups of covariates, X ( i )− which are held fixed One could use any partition for conditioning or use the partition already generated by the recursive partitioning procedure Further one could include all variables X ( i )− to condition on or only include those variables whose correlation with X i exceeds a certain threshold The main drawback of this variable importance scheme

is its computational burden Ishwaran (Ishwaran 2007) carefully studied variable importance with highly correlated variables with a simpler definition of variable importance Variable importance was defined as the difference in prediction error using the original variable and a random node assignment after the variable is encountered Two-way interactions were examined via jointly permuted variable importance This method allows for the explicit ranking of the interactions in relation to all other variables in terms of their

relative importance even in the face of correlation However for large p, examining all

two-way variable importance measures would be computationally infeasible Tuv et al (Tuv et al., 2009) takes a random permutation of each potential predictor and a Random Forest is generated from this and the variable importance scores are compared to the original scores

Trang 6

via the t-test Surrogate variables are eliminated by the generation of gradient boosted trees Then by iteratively selecting the top variables on the variable importance and then re-running Random Forests, they were able to obtain smaller and smaller numbers of predictors

5 Other issues in variable importance in Random Forests

Because Random Forests are often used as a screening tool based on the results of the variable importance ranking, it is important to consider some of the properties of the variable importance measures especially under various assumptions

5.1 Different measurement scales

In the original implementation of CART, Breiman noted that the Gini index was biased towards variables with more possible splits (Breiman et al., 1984) When data types are measured on different scales such as when some variables are continuous while others are categorical, it has been found that Gini importance is biased (Strobl et al., 208; Breiman et al., 1984; White and Liu 1994; Hothorn et al., 2006; Strobl et al., 2007; Sandri and Zuvvolotto 2008) In some cases suboptimal variables could be artificially inflated in these scenarios Strobl et al found that using the permutation variable importance with subsampling without replacement provided unbiased variable selection (Strobl et al., 2007) In simulation studies, Strobl (Strobl et al., 2007) shows that the Gini criteria is strongly biased with mixed data types and proposed using a conditional inference framework for constructing forests Further they show that under the original implementation of random forests, permutation importance is also biased This difference was diminished when using conditional inference forests and when subsampling was performed without replacement Because of this bias, permutation importance is now the default importance measure in the random forest package in R (Breiman 2002)

5.1 Correlated predictors

Permutation variable importance rankings have been found to be unstable for when filtering Single Nucleotide Polymorphisms (SNP) variable importance (Nicodemus et al., 2007; Calle and Urrea 2010) The notion of stability, in this case, is that the genes on the “important” lists remain constant throughout multiple runs of the Random Forests Genomic data such

as microarray data and sequence data often have high correlation among the potential predictor variables Several studies have shown that high correlation among the potential predictor X’s poses problems with variable importance measures in Random Forests (Strobl

et al 2008; Nicodemus and Malley 2009; Nicodemus et al., 2010) Nicodemus found that there is a bias towards uncorrelated predictors and that there is a dependence on the size of

the subset sample mtry (Nicodemus and Malley 2009) Computer simulations have found

that surrogate (highly correlated variables) are often within the set of highly ranked important variables but that these variables are unlikely to be on the same tree In a sense, these variables compete for selection into a tree This competition diminishes their impact on the variable importance scores The ranking procedure based on Gini and permutation importance cannot distinguish between the correlated predictors In simulations when the correlation between variables is less that 0.4, any variable importance measure appears to work well with the true variables being among the top listed variables in the variable

Trang 7

in High Dimensional (Large P, Small N) Biomedical Applications 595 importance ranking with multiple runs of the Random Forest Using Gini variable importance, variables with correlations less than 0.5 appear to have minimal impact on the size of the variable importance ranking list that includes the variables that are truly related

to the outcome The graph below shows how large the variable importance list has to be to recover 10 true variables among 100 total variables, 90 of which are random noise and independent of the outcome variables under various levels of correlation among the predictors using Gini variable importance (GVI) and permutation variable importance (PVI)

This result is similar to that found by Archer and Kimes showing that Gini variable importance is stable under moderate correlation in that the true predictor may not be the highest listed under the most important variables but will be among the set of high valued variables (Archer and Kimes 2008) This result is also consistent with the findings of Nonyane and Foulkes (Nonyane and Foulkes 2008) They found that in comparing Random Forests and Multivariate Adaptive Regression Splines (MARS) in simulated genetic data with one true effect,X1, and seven correlated but uninformative variables and one covariate

Z under six different model structures They define the true discovery rate as: if theX1, the

true variable, is listed first or second to Z in the variable importance ranking using the Gini

variable importance measure They found that for correlation less than 0.5, the true discovery rate is relatively stable regardless of how one handles the covariate

Several solutions for correlated variables have been proposed Sandri and Zuccolotto proposed the use of pseudovariables as a correction for the bias in Gini importance (Sandri and Zuvvolotto 2008) In a study of SNPs in linkage disequilibrium, Meng et al restricted the tree-building algorithm to disallow correlated predictors in the same tree (Meng et al 2009)

Trang 8

They found that the stronger the degree of association of the predictor to the response, the stronger the effect of the correlation has on the performance of the forest Strobl 2008 also found that with under strong correlation, conditional inference trees using permutation variable importance also had a bias in variable selection (Strobl et al 2008) To overcome this bias they developed a conditional permutation scheme where the variable to be permuted was permuted conditional on the other correlated variables which are held fixed In this set

up one can use any partition of the feature space such as a binary partition learned from a tree to condition on Use the recursive partitioning to define the partition and then: 1) compute OOB prediction accuracy for each tree, 2) for all variables Z to be conditioned on, create a grid 3) permute within a grid of X i and compute OOB prediction accuracy 4) difference the accuracy averaged across all trees Z could be all other variables besides X ior all variables correlated with X i with a correlation coefficient higher than a set threshold Similar to Nicodemus and Malley, they found that permutation variable importance was biased when there exists correlation among the X variables and this was especially true with

small values of mtry (Nicodemus and Malley 2009) They also found that while bias decreases with larger values of mtry, variability increases In simulations, conditional

permutation variable importance still had a preference for highly correlated variables but less so that standard permutation variable importance The authors suggest using different

values of mtry and a large number of trees so results with different seeds do not vary

systematically

In another study Nicodemus found that permutation variable importance had preference for uncorrelated variables because correlated variables compete with each other (Nicodemus et

al., 2010) They also found that large values of mtry can inflate the importance for correlated

predictors for permutation variable importance They found the opposite effect for conditional variable importance Further they found that conditional variable importance measures from Conditional Inference Forests inflated uncorrelated strongly associated variables relative to correlated strongly associated variables They also found that conditional permuation importance was computationally intractable for large datasets The authors were only able to calculate this measure for n=500 and for only 12 predictors They conclude that conditional variable importance is useful for small studies where the goal is to identify the set of true predictors among a set of correlated predictors In studies such as genetic association studies where the set of predictors is large, original permutation based variable importance may be better suited

In genomic association studies, often one wants to find the smallest set of non-related genes that are potentially related to the outcome for further study One method is to select an

arbitrary threshold and list the top h variables in the variable importance list Another

approach is to iteratively use Random Forests, feeding in the top variables from the variable importance list as potential predictors and selecting the final model as the one with the smallest error rate given a subset of genes (Diaz-Uriarte and Alvarez de Andes 2006) Geneur et al used a similar two-stage approach with highly correlated variables where one first eliminates lowest ranked variables ranked by importance and then tested nested models in a stepwise fashion, selecting the most parsimonious model with the minimum OOB error rate (Geneur et al., 2010) They found that under high correlation there was high

variance on variable importance lists They proposed that mtry be drawn from the variable

ranking distribution and not uniformly across all variables although this was not specifically

Trang 9

in High Dimensional (Large P, Small N) Biomedical Applications 597 tested Meng et al also used an iterative machine leanring scheme where the top ranked important variables were assessed using Random Forests and then used as predictors in a separate prediction algorithm (Meng et al 2007) Specifically, Random Forests was used to narrow the parameter space and then the top ranked variables were used in a Bayesian network for prediction They found that using the top 50SNPs in the variable importance list

as the predictors for a second Random Forest resulted in good variable selection in their simulations, although the generalizability is not known (Meng et al 2007)

For all Random Forest implementations it is recommended that one:

1 Grow a large forest with a large number of trees (ntree at least 5000)

2 Use a large terminal node size

3 Try different values of mtry and seeds Try setting mtry= mdimas an initial starting

value for mtry; where mdim is the number of potential predictors

4 Run algorithm repeatedly That is, create several random forests until the variable importance list appears stable

In using Random Forests for variable selection we can make several recommendations These recommendations vary by the nature of the data It is well known that the Gini variable importance has bias in its variable selection thus for most instances we recommend permutation variable importance Indeed this is the default in the R package randomForest

If the predictors are all measured on the same scale and are independent then this default should be sufficient If the data are of mixed type (measured on different scales), then use Conditional Inference Forests with permutation variable importance Use subsampling without replacement instead of the default bootstrap sampling as suggested by Strobl 2007 All measures of variable importance have bias under strong correlation It is important to test whether the variables are correlated If there is correlation, then one must assess the goal

of the study If there is high correlation among the X’s and the p is small and the goal of the

study is to find the set of true predictors, then using conditional inference trees and conditional permutation variable importance is a good solution However if there is a large

p using conditional permuation importance may be computationally infeasible and either some parameter space reduction will be necessary In that case, using permutation importance using Random Forests or iterative random Forests may be better suited for creating a list of important variables

If there are highly correlated variables and there if p or n is large thenone can use Random Forests iteratively with permutation variable importance In this case one selects the top h

variables in the variable importance ranking list as predictors for another Random Forest

In this case h is selected by the user Meng et al used the top 50 percent of the predictors

This scenario works best when there is a strong association of the predictors to the outcome (Meng et al., 2007)

7 References

Archer, K and R Kimes (2008) "Empirical characterization of random forest variable

importance measures." Computational Statistics and Data Analysis 52(4):

2249-2260

Trang 10

Bishop, C (1995) Neural networks for pattern recognition Oxford, Clarendon Press

Breiman, L (1996) "Bagging predictors." Machine Learning 24(2): 123-140

Breiman, L (2001) "Random Forests." Machine Learning 45: 5-32

Breiman, L (2001) "Statistical modeling: the two cultures." Stat Science 16: 199-231

Breiman, L (2002) "Manual on setting up, using, and understanding Random Forests V3.1."

Technical Report

Breiman, L., J Friedman, R Olshen and C Stone (1984) Classification and Regression Trees

Belmont, CA, Wadsworth International Group

Bureau, A., J Dupuis, K Falls, K L Lunetta, L B Hayward, T P Keith and P V

Eerdewegh (2005) "Identifying SNPs predictive of phenotype using random forests." Genetic Epidemiology 28: 171-182

Calle, M and V Urrea (2010) "Letter to the editor: stability of random forest importance

measures." Briefings in Bioinformatics 2010

Dasarathy, B (1991) Nearest-neighbor pattern classification techniques Los Alamitos, IEEE

Computer Society Press

Diaz-Uriarte, R and S Alvarez de Andes (2006) "Gene selection and classification of

microarray data using random forests." BMC Bioinformatics 7: 3

Dietterich, T (2000) "An experimental comparison of three methods for constructing

ensembles of decision trees: bagging, boosting and randomization." Machine Learning 40: 139-158

Dybowski, J., D Heider and D Hoffman (2010) "Prediction of co-receptor usage of HIV-1

from genotype." PLOS Computational Biology 6(4): e1000743

Geneur, R., J Poggi and C Tuleau-Malot (2010) "Variable selection using random forests."

Pattern Recognitions Letters 31: 2225-2236

George, E I and R E McCulloch (1993) "Variable selection via gibbs sampling." Journal of

the American Statistical Association 88: 881 89

George, I and R E McCulloch (1997) "Approached for Bayesian variable selection."

Statistica Sinica 7: 339-373

Ho, K (1998) "The random subspace method for constructing decision forests." IEEE

Transactions on Pattern Analysis and Machine Intelligence 20(8): 832-844

Hothorn, T., K Hornik and A Zeileis (2006) "Unbiased recursive partitioning: a conditional

inference frameork." Journal of Computational and Graphical Statistics 15(3):

651-674

Ishwaran, H (2007) "Variable importance in binary regression trees and forests." Electronic

Journal of Statistics 1: 519-537

Kitchen, C., R Weiss, G Liu and T Wrin (2007) "HIV-1 viral fitness estimation using

exchangeable on subset priors and prior model selection." Statistics in Medicine 26(5): 975-990

Kuo, L and B Mallick (1999) "Variable selection for regression models." Sankya B 60: 65 81 Lee, J., J Lee, M Park and S Song (2005) "An extensive compairson of recent classification

tools applied to microarray data." Computational Statistics and Data Analysis 48: 869-885

Loh, W.-Y (2002) "Regression trees with unbiased variable selection and interaction

detection." Statistica Sinica 12: 361-386

Loh, W.-Y and Y.-S Shih (1997) "Split slection methods for classification trees." Statistica

Sinica 7: 815-840

Trang 11

in High Dimensional (Large P, Small N) Biomedical Applications 599 Lunetta, K L., L B Hayward, J Segal and P V Eerdewegh (2004) "Screening large-scale

association study data: exploiting interactions using random forests." BMC Genetics 5: 32

Meng, Y., Q Yang, K Cuenco, L Cupples, A DeStefano and K L Lunetta (2007)

"Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks." BMC Proceedings 1(Suppl 1): S56

Meng, Y., Y Yu, L Adrienne Cupples, L Farrer and K Lunetta (2009) "Performance of

random forest when SNPs are in linkage disequilibrium." BMC Bioinformatics 10:

78

Nicodemus, K and J Malley (2009) "Predictor correlation impacts machine learning

algorithms: implications for genomic studies." Bioinformatics 25(15): 1884-90 Nicodemus, K., J Malley, C Strobl and A Ziegler (2010) "The behaviour of random forest

permutation-based variable importance measures under predictor correlation." BMC Bioinformatics 11: 110

Nicodemus, K., W Wang and Y Shugart (2007) "Stability of variable importance scores and

rankings using statistical learning tools on single nucleotide polymorphisms (SNPs) and risk factors involved in gene-gene nd gene-environment interaction." BMC Proceedings 1(Suppl 1): S58

Nonyane, B and A S Foulkes (2008) "Application of two machine learning algorithms to

genetic association studies in the presence of covariates." BMC Genetics 9: 71 Pers, T., A Albrechtsen, C Holst, T Sorensen and T Gerds (2009) "The validation and

assessment of machine learning: a game of prediction from high-dimensional data." PLoS One 4(8): e6287

Qi, Y., Z Bar-Joseph and J Klein-Seetharaman (2006) "Evaluation of different biological

data and computational classification methods for use in protein interaction prediction." Proteins 63: 490-500

Rakotomamonjy, A (2003) "Variable selection using SVM-based criteria." Journal of

Machine Learning Research 3: 1357-1370

Sandri, M and P Zuvvolotto (2008) "A bias correction algorithm for the Gini variable

importance measure in classification trees." Journal of Computational and Graphical Statistics 17(3): 611-628

Segal, M R., J D Barbour and R Grant (2004) "Relating HIV-1 sequence variation to

replication capacity via trees and forests." Statistical Applications in Genetics and Molecular Biology 3: 2

Strobl, C., A Boulesteix, T Kneib, T Augustin and A Zeileis (208) "Conditional variable

importance for random forests." BMC Bioinformatics 9: 307

Strobl, C., A Boulesteix, T Kneib, T Augustin and A Zeileis (2008) "Conditional variable

importance for random forests." BMC Bioinformatics 9: 307

Strobl, C., A Boulesteix, A Zeileis and T Hothorn (2007) "Bias in random forest variable

importance measures: illustrations, sources and a solution." BMC Bioinformatics 8:

25

Tuv, E., A Borisov, G Runger and K Torkkola (2009) "Feature selection with ensembles,

artifical variables and redundancy elimination." Journal of Machine Learning Research 10: 1341-1366

Trang 12

van der Laan, M (2006) "Statistical inference for variable importance." International Journal

of Biostatistics 2: 1008

Vapnik, V (1998) Statistical learning theory, Wiley

White, A and W Z Liu (1994) "Bias in information-based measures in decision tree

induction." Machine Learning 15: 321-329

Xu, S., X Hunag, H Xu and C Zhang (2007) "Improved prediction of coreceptor usage and

phenotype of HIV-1 based on combined features of V3 loops sequence using random forest." The Journal of Microbiology 45(5): 441-446

Trang 13

31

Biomedical Knowledge Engineering

Using a Computational Grid

Marcello Castellano and Raffaele Stifini

Politecnico di Bari, Bari

Italy

1 Introduction

Bioengineering is an applied engineering discipline with the aims to develop specific methods and technologies for a better understanding of biological phenomena and health solutions to face the problems regarding the sciences of life It is based on fields such as biology, electronic engineering, information technology (I.T.), mechanics and chemistry (MIT, 1999) Methods of Bioengineering concern: the modeling of the physiological systems , the description of electric phenomena or magnetic ones ,the processing of data, the designing of medical equipments and materials or tissues, the study of organisms and the analysis of the link structure property typical of biomaterials or biomechanical structures Technologies of Bioengineering include: biomedical and biotechnological instruments (from the elementary components to the most complex hospital systems), prosthesis, robots for biomedical uses, artificial intelligent system, sanitary management systems, information systems, medical informatics, telemedicine (J E Bekelman et al, 2003)

Biomedicine has recently had an innovative impulse through applications of computer science in Bioengineering field Medical Informatics or better the Bioinformatics technology

is characterized by the development of automatic applications in the biological sector whose central element is the information There are several reasons to apply the “computer science” in many fields, such as the biomedical one Advantages as the turn-around time and precision are among the basically improving factors for a job For example the identification of the functions of genes has taken advantage from the application of an automatically system of analysis of database containing the result of many experiments of microarray getting information on the human genes involved in pathologies (C Müller et al, 2009) With a such approach regions with specifically activities have been identified inside the DNA regions, different regions exist in the genome, some stretches are the actual genes, others regulates the functions of the former ones Other research have been made through computational techniques on the Functional Genomics, Biopolymers and Proteomics, Biobank e Cell Factory (M Liebman et al, 2008)

This chapter explores a particularly promising area of systems development technological based on the concept of knowledge The knowledge is useful learning result obtained by an

information processing activity The Knowledge Engineering, regards the integration of the

knowledge in computer systems in order to solve the difficult problems which typically require a high level of human specialization (M C Linn, 1993)

Trang 14

Whereas standalone computer system have had an important impact in Biomedicine, the computer networks are nowadays a technology to investigate new opportunities of innovation The capacity of the networks to link so many information allows both to improve the already existing applications and introduce new ones; Internet and the Web are two well know examples Information based processes involved in the research to discovery new knowledge take advantage from the new paradigms of distributed computing systems

This chapter is focuses on the design aspects of the knowledge-based computer systems

applied to the biomedicine field The mission is to support the specialist or researcher to solve problems with greater awareness and precision At the purpose, a framework to specify a computational model will be presented As an example, an application of the method to the diagnostic process will be discussed to specify a knowledge-based decision support system The solution here proposed is not only to create a knowledge base by the human expert (or by a pool of experts) but support it using automatic knowledge discovery process and resources enhancing data, information and collaboration in order to produce new expert knowledge over time

Interoperability, resource sharing, security and collaborative computing will emerge and a computational model based on grid computing will be taken into account in order to discuss

an advanced biomedical application In particular in the next section it will be presented a framework for the Knowledge Engineering based on a problem solving strategy In section 3 the biomedical diagnostic process will be analyzed using the knowledge framework In particular the problem, the solution and knowledge resources will be carried out In section

4 the design activity of the diagnostic process is presented Results in terms of system specifications will be shown in terms of Decision Support System architecture, Knowledge Discovery and Grid Enable Knowledge Application A finally discussion will be presented

in the last section

2 Method for the knowledge

Modeling is a building activity inspired by the problem solving for real problems which not have a unique exact solution The Knowledge Engineering (K- Engineering) deals with the computer-system applications which are computational solutions of more complex problems which usually ask for a high levels of human skill In this case the human knowledge must be encoded and embedded in knowledge based applications of computer systems The K-Engineer build up a knowledge model useful for an algorithmic description

by a structured approach or method Three macro phases can be distinguished in the modeling process of knowledge:

1 Knowledge Identification (K- Identification);

2 Knowledge Specification (K- Specification);

3 Knowledge Refinement (K- Refinement)

These phases can be cyclical and times retroaction rings are necessary For instance, the simulation in the third phase can cause changes in the knowledge model (A Th Schreiber,

B J Weilinga, 1992) Each phases is composed by specific activities and for each activities the K- Engineering literature proposes different techniques The Fig 1 shows the modeling

of the knowledge based on the problem solving strategy The proposed framework is applied at different levels of abstraction from high to low level mechanisms (top-down method)

Trang 15

Biomedical Knowledge Engineering Using a Computational Grid 603

Fig 1 The knowledge modeling framework composed by phases and activities

The knowledge model is only an approximation of the real world, which can and must be modified during time

2.1 Knowledge Identification

A Knowledge Based System (KBS) is a computational system specialized in applications based on knowledge, aiming at reaching a problem solving ability comparable with a human expert The expert can describe a lot of aspects typical of his own way of reasoning but tends to neglect a significant part of personal abilities which cannot be easily explained This Knowledge, which is not directly accessible, must, however, be considered and then drawn out To mine the tacit knowledge an application on the elicitation techniques can be useful and it must be represented using a dynamic model

The analysis must be inherent in the aims of the planner On the other hand the representation of the complete domain is un-useful, so the effort is now to identify the problem, in order to finalize the domain analysis

The approach here proposed is based on the answer to questions that must be taken in to account to develop the basic characteristics of the K-Model as shown in Table 1

The phase of Knowledge Identification is subject to important considerations that go to better specify the system architecture Most of a man knowledge or of a group is tacit and cannot be outspoken wholly or partly Therefore, in a knowledge system, the human beings are not simple users, but an integrating part of the system The representation is necessarily different from what is represented; it can capture only the most relevant aspects, chosen from the modeler This difference can cause problems if one wants to use the model for differ purpose from the ones it quality allows Moreover the difference between the real world and its representation should cause problems of uncertainty and approximation which will be solved showing the quality of the relevant knowledge in real way

Trang 16

What must be represented:

At the epistemological level identify what should represent aspects of knowledge that is necessary

to consider the application to be addressed In particular, what are its classes, patterns, what are the inferential processes involved and the quality

of relevant knowledge

Which is the problem:

To identify the problem to be solved is important

to address the investigation about the relevant knowledge It will be very important in the next modeling phases

How the problem can be solved:

It indicates strategies for solving a given problem based on patterns bounded in the application domain

How to represent:

Modeling derives from the subjective interpretation of the knowledge engineer with regards to the problem to be faced; a mistake is always possible and therefore the knowledge model must be made in a revisable way Tools and processes for the knowledge management have been consolidated, this management can be expressed in several ways: rules, procedures, laws, mathematical formulae, structural descriptions Table1 Knowledge Identification guidelines

The interviewee must have some characteristics related with his life or his belonging to a certain social group, the number of the people interviewed must however be consistent, so it

is possible to obtain every possible information on the phenomenon

The conversation between the two parts is not comparable to a normal conversation because the roles are not balanced: the interviewer drives and controls the interview respecting the freedom of the interviewer in expressing his opinions

According to the different degree of flexibility is possible to distinguish among:

1 structured interview

2 semi-structured interview

3 not structured interview

Usually the structured interview is used to investigate a wide phenomenon, the interview is carried out by a questionnaire supplied to a large sample of people; in this case the hypothesis must be well structured a priori

The structured interview can be used in a standard way but at the same time the limited knowledge of the phenomenon does not allow the use of the multiple choice questionnaire

As the number of the interviewees decreases a semi-structured or a un-structured interview can be taken in to account

Trang 17

Biomedical Knowledge Engineering Using a Computational Grid 605

Knowing the mental patterns and the implicit categorizations makes possible the organization of the information so that it is more simple to use them, improving, in that way, the quality of the product

Through the elicitation analysis it is possible to identify the classification criterion used by the users and to identify the content and the labels of categories they used Possible differences in the categorization among various groups of interviewed can be seen and controlled

2.1.3 A draft of the conceptual model

This activity establish a first formal representation of knowledge acquired up to now composed by elements and their relationships The representation is used to check the correctness by the user It is a formal scheme on which the K-specification phase will runs The knowledge is represented using an high level description called conceptual model This model is called conceptual because it is the result of a survey carried out by the literature and domain experts for the transfer of concepts considered useful in the field of study is concerned Fundamental indications about “what is” and “how to build” the conceptual model are shown in Table 2 Some formalisms are proposed in the literature: the semantic networks are used to represent the knowledge with a graph structure; the frames are data structures which allow group, like inside a frame, the information about an entity; an object representation allows to join procedural aspects with declarative aspects, in a single formalism; and so on

What it is not: • it is not a basic of knowledge on paper/calculator

• it is not an intermediate representation

What it is:

• it is a complete articulate and permanent representation

of the structure of the knowledge of an applicative dominion both from a static point of view and a dynamic one

• it is a representation independent from implementation

• it is the main result of the activity of knowledge analysis and modeling

How the conceptual

model is built

(some criteria)

• a formalism for the expressive conceptual representation allows to express powerfully all the concepts, the relationships and the link typical of the application

• economic: synthetic, compact, readable, modifiable

• communicative: easily intelligible

• useful: a support for the analysis and modeling activities Table 2 The Conceptual Model guidelines

Trang 18

2.2.2 Task analysis

The aim of the task analysis is to identify the “main task” by the analysis of the users involvement in order to understand how to they execute their work, identifying types and levels:

• How is the work carried on when more people are involved (workflow analysis);

• A single man work during a day , a week or a mouth (job analysis);

• Which tasks are executed by all the people who could use to product (task list);

• The order every uses of execution of the tasks (tasks sequences);

• Which steps and decision the user chooses to accomplish a task (procedural analysis);

• How to explode a wide task into more subtasks (hierarchies of task)

The Task analysis offers the possibility to view the needs, display the improvement areas and simplify the evaluation It can be carried out according to: (Mager,1975; Gagnè,1970)

• rational analysis - Inside the theories of the Knowledge it is a procedure which divides

a task into simpler abilities, up to reaching the activities that can be executed by every process the task is assigned to The result of this procedure is a typical hierarchy of activities with a correspondent hierarchy of execution aims

• empirical analysis - Inside the Knowledge Engineering it indicates a procedure which splits up the activity or task into executive process, strategies and meta-cognitive operations which the subject accomplishes during the execution of that task The result

is a sequence, not always ordered, of operations aiming at the realization of the task This is an activity about K-Specification It works on the output of the K-Identification (see Table 1), that specify the resolution strategy of the problem At the purpose the task analysis

is carried out by the following specific steps: Problem specification; Activity analysis; Task modeling and Reaching of a solution

In the Problem Specification phase the problem must be identified specifying one or more

activities for its realization, at the conceptual level; these activities will be analyzed in the

following step In Activity Analysis a task is identified grouping activities which must be

Trang 19

Biomedical Knowledge Engineering Using a Computational Grid 607 executed to achieve the aim of the task There are different task hierarchies where the activities can be divided into subtask This exercise on the task hierarchy means both to specialize every task and to study the task execution on the base of priorities and temporal

lines Task Modeling builds a model which precisely describes the relationships among

tasks A model is a logical description of the activities which must be executed to achieve the users goals The model based design aims to specify and analyze interactive software applications from a more semantic level rather than from an implementative one Methods for modeling the tasks are:

• Standard : analysis on how tasks should be made;

• Descriptive: analysis of the activities and tasks just as they are really made Task models can be taken into accounts according to the following point of view:

• System task model It describes how the common system implementation states the tasks must be executed;

• Envisioned task model It describes how the users must interact with the system according to the opinion of the designer;

• User task model It describes how the tasks must be done in order to reach the objects according to the opinion of the users

Usability problems can arises when discrepancy occurs between the user task model and the

system model The last step in the task analysis (Reaching of a solution) is devoted to specify the tasks identified That are conceptual building block of this analysis Table 3

shows a formalism to specify the task aim, the technique used for the realization of it and the result produced on the task execution Moreover a procedural description of the task must be carried out using conceptual building blocks based tools

Table 3 A Task Description formalism

2.3 Knowledge refinement

The aim of the Knowledge Refinement is to validate the knowledge model using it in a simulation process as much as possible Moreover it is to complete the knowledge base by inserting a more or less complete set of knowledge instances

3 Analyzing the biomedical diagnostic process building a model for

knowledge based system

The case of study here presented, refers to the diagnostic process This is a rich knowledge process prevalent in the biomedical field and to diagnostic pathologies starting from the symptoms

3.1 Knowledge identification

The identification of knowledge in biomedicine has been here applied as described in table

4, using the framework proposed in the previous section

Trang 20

Key Questions to drive the Knowledge Methods

<What to represent> Elicitation Analysis

<Which is the problem> Interview

<How it can be solved> Interview, Elicitation Analysis

<How to represent> Semantic Network, Elicitation Analysis, Interview

Table 4 The K-Identification for the diagnostic process

3.1.1 Elicitation analysis

In order to create the first reference model of the diagnostic process the K-Identification starts with the elicitation study As the final aim is the development of an conceptual model could be useful to consider a process comparable where a living organism is like a perfectly working computer If a computer problem rise, the operative system signal it to the user To activate such a process, a warning is necessary, i.e a message of mistake or wrong working

At this point a good computer technician put in action a diagnostic process based on the warning to individuate the problem, or in other words the error

The diagnosis on an organism is similar at the described scenery: the occurring of a pathology is pointed out to the organism through the signal of one or more symptoms (as already described for the computer errors) The diagnostic medical process is exercised by the specialist (in analogy of the computer technician) that will study the symptom origin and its cause and hence the diseases The arising of a problem can be due both endogen and exogenous causes and provokes an alteration which would not normally happen (for instance an alteration in the albumin level produced by the pancreas ); this mutation causes

a change, a working different from the mechanisms associated to that element (for example the mechanism, thanks to which the insulin makes the glucose enter into the cells for the production of vital energy, changes as an accumulation of glucose in the blood circle is met: subjects affected with diabetes) What it has been learned by the above application described

on the elicitation analysis is shown in Table 5

A stirring up cause…

Following a diminution of electric tension the computer is turned off while the hard disk is

…and a problem arises

in the system Not readable starting record Aching eymphonodes

…and makes clear

Reveals itself The computer show the error The patient has throat ache Table 5 Some elicitation analysis results to know the Diagnostic Process

Tiêu đề	Biomedical Engineering Trends in Electronics, Communications and Software
Trường học	Biomedical Engineering Department
Chuyên ngành	Biomedical Engineering
Thể loại	Bài báo
Năm xuất bản	2023
Thành phố	City Name

Định dạng
Số trang	40
Dung lượng	1,33 MB