Exploring different strategies for imbalanced ADME data problem case study on Caco 2 permeability modeling

Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of overs

Trang 1

DOI 10.1007/s11030-015-9649-4

F U L L - L E N G T H PA P E R

Exploring different strategies for imbalanced ADME data

problem: case study on Caco-2 permeability modeling

Hai Pham-The 1 · Gerardo Casañola-Martin 2,3,4 · Teresa Garrigues 5 ·

Marival Bermejo 6 · Isabel González-Álvarez 6 · Nam Nguyen-Hai 1 ·

Miguel Ángel Cabrera-Pérez 5,6,7 · Huong Le-Thi-Thu 8

Received: 23 May 2015 / Accepted: 13 November 2015

Abstract In many absorption, distribution, metabolism,

and excretion (ADME) modeling problems, imbalanced

data could negatively affect classification performance of

machine learning algorithms Solutions for handling

imbal-anced dataset have been proposed, but their application for

ADME modeling tasks is underexplored In this paper,

var-ious strategies including cost-sensitive learning and

resam-pling methods were studied to tackle the moderate imbalance

problem of a large Caco-2 cell permeability database

Sim-Electronic supplementary material The online version of this

article (doi: 10.1007/s11030-015-9649-4 ) contains supplementary

material, which is available to authorized users.

B Huong Le-Thi-Thu

ltthuong1017@gmail.com

1 Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi,

Vietnam

2 Departament de Bioquímica i Biologia Molecular, Universitat

de València, Burjassot, 46100 Valencia, Spain

3 Unidad de Investigación de Diseño de Fármacos y

Conectividad Molecular, Departamento de Química Física,

Facultad de Farmacia, Universitat de València, Valencia,

Spain

4 Facultad de Ingeniería Ambiental, Universidad Estatal

Amazónica, Paso lateral km 2 1/2 via Napo, Puyo, Ecuador

5 Department of Pharmacy and Pharmaceutical Technology,

University of Valencia, Burjassot, 46100 Valencia, Spain

6 Department of Engineering, Area of Pharmacy and

Pharmaceutical Technology, Miguel Hernández University,

03550 Sant Joan d’Alacant, Alicante, Spain

7 Unit of Modeling and Experimental Biopharmaceutics,

Chemical Bioactive Center, Central University of Las Villas,

54830 Santa Clara, Villa Clara, Cuba

8 School of Medicine and Pharmacy, Vietnam National

University, 144 Xuan Thuy, Hanoi, Vietnam

ple physicochemical molecular descriptors were utilized for data modeling Support vector machine classifiers were con-structed and compared using multiple comparison tests Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of oversampling data where misclassification rates for minority class have values of 0.11 and 0.14 for training and test set, respectively A consensus model with enhanced applicability domain was subsequently constructed and showed improved performance This model was used to predict a set of ran-domly selected high-permeability reference drugs according

to the biopharmaceutics classification system Overall, this study provides a comparison of numerous rebalancing strate-gies and displays the effectiveness of oversampling methods

to deal with imbalanced permeability data problems

Keywords ADME modeling· Caco-2 cell permeability · Biopharmaceutics classification system· Support vector machine· Cost-sensitive learning · Resampling technique

Abbreviations

AD Applicability domain ADME Absorption, distribution, metabolism, and

excre-tion AURC Area under the ROC BCS Biopharmaceutics classification system

BE Bioequivalence

C Penalty parameter

CD Critical distance EMA European medicines agency

F Bioavailability

FN False negative

Trang 2

FDA US Food and Drug Administration

FP False positive

H class High-permeability class

HIA Human intestinal absorption

IVIVC In vitro–In vivo correlation

MD Molecular descriptor

M-P class Moderate-to-poor permeability class

Papp Apparent permeability coefficient

RBF Radial basis function

ROC Receiver operator curve

SMOTE Synthetic minority oversampling technique

SVM Support vector machine

SVs Support vectors

WHO World health organization

Introduction

Issues associated with imbalanced class distribution are

fre-quently encountered in real-world applications of machine

learning and data mining methods [1] As reported in the

literature, data imbalance-related issues normally originate

from two distinct problems: (i) the interclass imbalance,

where the distribution of class labels varies widely and (ii)

the within-class or intraclass imbalance, where the

distrib-ution of members within each class is unequal [2] In most

applications, independently from imbalance degree,

classi-fiers tend to learn from prevalent classes, while ignoring the

small classes Consequently, the overall predictions often

have bias toward the majority class and such “apparent” high

overall accuracy is meaningless when we also consider the

minority class [1]

Numerous strategies have been proposed to handle

imbal-anced dataset In general, these could be grouped in two

categories based on their rebalancing targets: (i)

reconsid-ering the misclassification cost or cost-sensitive learning and

(ii) creating more balanced class distribution in the training

set or data sampling [2] Since the cost-sensitive approaches

require modification at the algorithm level, so resampling

methods that only make change on the data distribution are

considered more applicable [3] However, to date there is

little evidence showing that one strategy is better than the

other Consequently, a comparative analysis is necessary to

draw valid conclusions when applying a strategy

On the other hand, for the classification of many ADME

properties, the collected data often display imbalanced class

distribution Evidence can be found in reported studies

of blood-brain barrier (BBB) penetration [4], adenosine

triphosphate (ATP)-binding cassette (ABC) transporter [5],

protein binding [4], Cytochrome P450 (CYP)

enzyme-substrate/inhibitor specificity [6], human intestinal

absorp-tion (HIA) [7,8], bioavailability (F), and so on [4] In most

cases, researchers agreed that the class imbalance problem

is critical in data modeling Therefore, the management of imbalanced dataset should receive special attention in order

to improve and rebalance model performance Solutions for imbalanced data exist, but their applications in ADME mod-eling are still limited Therefore, an exploratory analysis of the most common strategies reported to deal with the imbal-anced ADME data problem is needed

Among ADME properties, permeability, the capability

of a drug to penetrate across the human gastrointestinal tract (GIT), is a key factor governing human intestinal absorption (HIA) [9] In vitro models, such as MDCK (Madin-Darby Canin Kidney) and Caco-2 (adenocarcinoma cells derived from colon), are widely used as high throughput screening (HTS) methods for the permeability assessment

of drug-like molecules in the early stages of drug discov-ery [10] Especially, the Caco-2 monolayer, which exhibits morphological as well as functional similarities to human intestinal enterocytes, is considered as a better surrogate marker for estimating in vivo drug absorption than other epithelial cell cultures [11,12] Currently, Caco-2 monolay-ers are recommended by numerous regulatory agencies for the permeability classification of drugs according to the bio-pharmaceutics classification system (BCS) [11,13,14] Prior to running in vitro Caco-2 assays, reliable per-meability prediction through the use of computational (in silico) tools is very useful, e.g., to prioritize compounds to

be tested as well as to guide the structural optimization in order to improve the absorption profile of lead compounds However, accurate in silico permeability prediction is one

of the most difficult tasks in ADME modeling [12,15,16] Data reported in the literature show considerable inter- and intra-laboratory variability [17] We therefore consider that permeability data analysis by classification is better than regression methods By dividing the dataset into two groups

(high vs medium–low permeability), the within-group

vari-ability of the experimental data could be greatly reduced

In our previous studies, we developed various classification models based on a large and heterogeneous Caco-2 perme-ability database compiled from different sources [12,15] Unsurprisingly, the results supported our hypothesis that classification models could overcome the data variability problem and accurately estimate experimental in vitro mea-surements These studies also demonstrated the potential of quantitative structure-property relationship (QSPR) models

in the HIA prediction of compounds that undergo passive transport mechanisms In this regard, a suitable

permeabil-ity cut-off value that maximizes in vitro–in vivo correlation

(IVIVC) should be defined [18] Given that a threshold value plays an arbitrary role that affects directly the data skewness and model performance, it is of the utmost importance that imbalance problems should be preliminarily treated in order

to develop accurate permeability classification models [11]

Trang 3

Based on the challenges mentioned above, the main goals

of this work are as follows: (i) to investigate the

poten-tial of the most common rebalancing strategies (at the data

and the algorithm levels) reported in the literature for the

prediction of a large and imbalanced Caco-2 permeability

database; (ii) to obtain reliable classification models that

improve the prediction of high-permeable compounds

tak-ing into account the permeability class definitions of the

BCS; and (iii) to corroborate the in silico predictions with in

vitro Caco-2 permeability of 47 drugs belonging to different

classes of the BCS Particularly, we briefly review the

prin-ciples of all the rebalancing strategies, concentrating on the

way they should be applied for handling imbalanced

perme-ability data classification problems, using standard machine

learning technique, such as support vector machine (SVM)

Materials and methods

Data collection and permeability class definition

In this study, a large and heterogeneous database composed of

1116 organic compounds was carefully assembled from more

than 320 published articles The data collection was

rou-tinely performed taking into account those factors that could

mainly contribute to the variability of experimental results as

described in the literature [11,12,15,17,18] Since a number

of compounds appeared with two or more in vitro assays, the

mean values were employed, excluding those laid outside of

the mean± 2SD (standard deviation) ranges In a

prelimi-nary inspection of the dataset, 12 compounds with very low

or very high molecular weight (MW≤ 60 or MW ≥ 1400

Da) were excluded from the modeling set because they are

most likely to cross the cell membrane via carrier-mediated

(active) transport [15] Furthermore, after calculating

molec-ular descriptors, a set of 14 quaternary amines and a sodium

salt appeared with some missing descriptor values was also

excluded from our modeling procedure (see Supplemental

Table S3) The remaining 1089 molecules, having

differ-ent physicochemical characteristics such as molecular size,

polarizability, hydrophilicity, lipophilicity, and molecular

charge, were used to develop global classification models

for the prediction of Caco-2 cell permeability

Based on the guidance provided by the US Food and Drug

Administration (FDA) for the application of in vitro

per-meability data in the context of BCS, we defined the high

permeability class that maximizes the fundamental

trade-off between in vitro permeability and human oral absorption

[14] In the literature, Metoprolol (average apparent

perme-ability Papp = 20 × 10−6cm/s with HIA rate of 96 %) is

widely used as a reference compound to discriminate high

from low permeable drugs However, current definitions of

high permeability class based on an HIA value of 90 % are

arguably too constrained [19] Therefore, a new threshold was adapted taking into account the lower confidence inter-val rule (0.8–1.25) commonly recommended by the FDA for assessing the bioequivalence (BE) of drug products (available

at http://www.fda.gov/cder/orange/default.htm) [11] This method was successfully developed by Kim et al to differen-tiate between high and low in situ permeability compounds, using the 90 % confidence interval of the permeability ratio

of the test compound to the reference compound like Meto-prolol [20] Finally, 352 compounds were assigned to the high-permeability class (H class), while the remaining 737 compounds belonged to the moderate-to-poor permeability class (M-P class) The latter class outnumbered the H class

by more than 2 times This is a typical case of moderately imbalanced binary dataset with overlapped classes

Computational method

Molecular descriptor calculation

Oral absorption is a complex process affected by physic-ochemical properties of the drug, drug formulation, and gastrointestinal physiology factors [9] Therefore, physic-ochemical descriptors should be preferentially considered for the construction of classification models In this study,

115 molecular descriptors (MDs) belonging to 6 different families (constitutional, ring, functional group counts, atom-centered fragments, charge descriptors, molecular properties and drug-like) were calculated using the SMILES code of each compound as input to the DRAGON software v.6.0

[21] For the calculation of charge descriptors, a preliminary semi-empirical PM3 structural optimization based on the Polak–Ribiere algorithm implemented in Hyperchemv.8.0

[22] was performed for each compound

Selection of training and test sets

The selection of training and test sets was performed using

k-means cluster analysis (k-MCA) implemented in

STATIS-TICAv.8.0 [23] Firstly, the dataset was split into kdifferent

clusters(k < 20) of the highest possible dissimilarity The

Fisher ratio and p-level of significance (p < 0.05) were

con-sidered to select the optimal number of variables included in the analysis and the number of clusters that represented the structural information of the dataset

Subsequently, compounds of the training and test sets were randomly selected from previous clusters In this procedure, the linkage distance of the members in each cluster was taken into account in such way that for each compound in

train-ing set there is always a similar compound in the test set.

Finally, the dataset was divided in two sets of 871 and 218 compounds, which corresponded to the training and test sets,

Trang 4

respectively The test set was never used in the development

of classification models

Additionally, a set of 47 compounds with experimental

Caco-2 permeability was collected to corroborate with

com-putational predictions Many of them are well-studied drugs

that have been commonly used as internal reference

stan-dards in Caco-2 cell permeability assays for testing new

compounds The prediction of this set is very useful for the

demonstration of the suitability and validity of the proposed

threshold in the context of the BCS

Support vector machine and effects of imbalanced data

on model construction

Support vector machine (SVM) was used for model building

in this work Since the pioneering theoretical studies of

Vap-nik and Burges [24,25], SVM technique has gained extensive

popularity in many research fields, particularly for

model-ing ADME and toxicity data SVM embodies the structural

risk minimization (SRM) principle which takes into account

the capacity of the classifier (similar to model complexity)

and the trade-off between minimization of training error and

reduction of the model complexity [24] A more detailed

description of the SVM theory can be widely found in the

literature Herein, we only describe briefly basic SVM theory

and methods to resolve imbalanced problem based on SVM

algorithm

Generally, each object in SVM is described by an input

vector x i of k real numbers (features or descriptors), which

corresponds to a point in a k-dimensional space If the number

of training set is n, then each case is described as X T r ai n =

(x i , y i ) , i = 1, n, where x i = (x i 1 , x i 2 , , x i k ) are

input vectors Response variable y i = +1/ − 1 corresponds

to H and M-P classes in this work Here we applied the

classi-fication SVM type 1 to find an optimal separating hyperplane

(maximum margin) by solving the following primal

opti-mization equation [26]:

Mi n

w,b,ξ

1

2w T w + C

k

i=1

ξ i subject to y i

w T φ (x i ) + b

where C is the capacity constant, w is the vector of

coef-ficients, b a constant, and ξ are parameters for handling

non-separable input data (non-negative slack variable ξ ≥

0, i = 1, k) The term φ(x i ) is a feature function (reverse

to kernel function) that maps x i into a higher dimensional

space It should be noted that the parameter C, also called

penalty, represents the trade-off between the empirical error

and the margin In theory, when a large value of C is set, the

model optimization procedure will choose a small-margin

for optimum separation hyperplane (OSH) so that the final

model is less sensitive to the errors and keeps the number

of misclassification small However, increasing C too much

can make the model loss the generalization capability and

easily overfit Choosing an appropriate value of C is one of

the main tasks for developing good SVM classifier [24] The

Eq.1can be solved using Lagrange multipliers and the final classification function is

sgn

w T φ (x) + b= sgn

k

i=1

y i α i K (x i , x) + b

(2)

For linearly separable classes, it is not necessary to

intro-duce slack variable or feature function In this case, the

optimization problem from Eq.1 can be solved by a sim-ple quadratic function under linear constraints Conversely, for the classification of linearly non-separable data, a kernel

function K (x i , x) that maps the input variable in a higher

dimensional space is defined in order to determine the shape

of OSH Several kernel functions have been used, such as linear, polynomial, radial basis function (RBF), and sig-moid kernel Among them, RBF is the most widely used and achieves, almost invariably, good performance [27]

K

x i , x j

= exp−γ x i − x j . 2

(3)

In this work, the RBF kernel was considered due to its advantages regarding to other kernels [27] The value of gamma (γ ) was set to 1 during the model optimization.

In order to determine the best value of parameter C, a

selection procedure using 10-fold cross-validations was per-formed For the construction of SVM models, we applied the extreme decomposition algorithm of Platt (SMO) and libSVM [26,28] implemented in WEKAv.6.0 [29] The

prin-ciple of maximal parsimony (Occam’s razor) was taken into

account Consequently, only classifiers displaying good per-formance with fewer variables and fewer support vectors (SVs) were selected

Various studies have analyzed the negative effect of data imbalance on SVM performance [30,31] In summary, three main problems were pointed out The first one is the bound-ary skewness In this problem, the imbalanced ratio of two classes in training set makes that the minority examples reside away from the “ideal boundary.” Then SVM tends to learn

a boundary that is too close to these instances and displays therefore imbalanced performance The second one is related

to the constant C specification, also known as weakness of

soft-margins [30] As can be seen from Eq.1, the unique

trade-off C assumed for two unbalanced classes does not

account for the different cumulative errors between classes

Veropoulos et al proposed a strategy to increase C associated

with the minority class [32] The last effect is the imbalance of

SV ratio It is believed that this imbalance can make the

Trang 5

clas-sification of a test instance close to the separation hyperplane

skew toward the majority class [33] However, according

to the Karush–Kuhn–Tucker (KKT) conditions [24], theα i

terms in Eq 2 which act as weights in the final decision

function must satisfy

0≤ α i ≤ C and

k

i=1

As described in Eq (4), theα ivalues corresponding to the

minority class must be larger than those of the majority class

Therefore, SVs of the smaller class generally receive higher

weights than the prevalent class, which can partly reduce the

effect of imbalance in SVs [30] Herein, the classification

imbalance was analyzed by comparing the number and the

ratio between the SVs corresponding to H and M-P classes

Solutions for imbalanced data problem

There are two main approaches for dealing with imbalanced

data problem: algorithm learning level (cost-based) and data

level (resampling) strategies

Algorithmic level strategies

At the algorithmic level, choosing an appropriate inductive

bias for target class is a common strategy As described above,

penalty constants associated with classification error for each

class should be modified Consequently, H class was assigned

with a higher penalty than M-P class, and Eq.1becomes

Mi n

w,b,ξ

1

2w T w + C H

y i=1

ξ i + C M −P

y i=−1

ξ i subject to y i

w T φ (x i ) + b≥ 1 − ξ i

where C H and C M −P are different penalty constants

asso-ciated with error terms for the H class and M-P class,

respectively For this method, we used the C-SVM algorithm

of LibSVM integrated in the WEKA environment where

dif-ferent weight parameters can be added for each class [26]

In this case, for two classes, we considered the ratio of the

number of compounds in every class to estimate the suitable

weights Corollary of this setting, “weights” parameter was

“0.5 − 1,” the penalty parameters for M-P and H classes are

0.5×C and 1×C, respectively (the C optimization procedure

was described above)

Another algorithmic level method is the cost-sensitive

learning In theory, standard classification algorithms assume

balanced data distribution and level of importance for each

class Therefore, standard data mining techniques could be

inefficient when trying to predict a minority class in an imbal-anced dataset or when a false negative (FN) is considered to

be more important than a false positive (FP) [34] Regard-ing this problem, cost-sensitive classifiers that take the cost matrix into consideration during model building and gener-ate models with the lowest expected cost were employed The cost matrix can be seen as an overlay to the standard confusion matrix In this case, the user can control the FN and FP rate by means of increasing the misclassification cost

of minority (H) class compounds with respect to the cost for

misclassifying majority (M-P) class Metacost was selected

to build models based on criteria of a previous study [35] This method re-labels the training samples with their min-imum expected cost classes, and then rebuilds the models based on modified training set For WEKA-SVM, it is

recom-mended using the BuildLogisticModels procedure for better

estimation of probability [34]

We gradually augmented the FN cost 20 times in stages from 1.5 to 2.5, while maintaining a cost of 1 for FP, and subsequently compared current FN rates to the original SVM performance without any modified cost These differences were incremented until 10 % for the cross-validation model, which coincided with a cost of 1.8FN The misclassification cost of H class was set to 1.8, while M-P class was maintained

at the value of 1

Data level strategies

At the data level, several approaches to resample (over- or undersample) the original data were used in this work The resampling approaches are well known to be more flexible than cost-based methods [36] However, there are some lim-itations of resampling approaches that one should take into account before developing computational models Firstly,

in medicinal chemistry, changing the number of instances could lead to different ADs in predictive space, especially with undersampling that can affect the robustness of in sil-ico models Secondly, since the optimal class distribution of the training data is usually unknown, learning from a forced sample-distribution may not reflect the real- world population distribution, which used to be randomly drawn Nevertheless, resampling continues to be the most widely used strategy because it could approximate the target population (rebal-ancing sample) better than the original (biased sample), even though random effects can no longer be considered [30] For the undersampling approach, a random subsample

of dataset was applied using supervised SpreadSubsample

filter in WEKA This technique allows us to specify the maximum “spread” between the rarest and most common class It means the selected subsample not only can achieve

a desirable uniform distribution of two classes, but also pre-serves the relationship among classes in original data sets (the maximum class distribution spread) Therefore, “in theory,”

Trang 6

this method may be useful in rebalancing data distribution

without throwing away valid instances In this work, after

choosing a distribution spread of 1, the training set was

divided equally into two classes The weight instances were

not considered, so this factor cannot influence the global

error

Since undersampling data may cause information loss,

other approaches that keep the majority, or at least, remove

fewer instances are desirable A common strategy is based

on multiplication of the minority class, namely oversampling

method However, the main drawback of this method is the

possible over-fitting problem [30] Therefore, a sophisticated

method proposed by Chawla et al [37], namely SMOTE

(Synthetic Minority Oversampling Technique), was used in

this work to preprocess the data This approach generates

synthetic samples of minority class on the basis of

similar-ity principle In brief, each new minorsimilar-ity sample is created

from the two nearest neighbors and has similar structure

with them [37] Generally, SMOTE presents some

advan-tages over the random oversampling approach because of its

informed properties and rebalance capability without

caus-ing over-fittcaus-ing problem In WEKA, the user can specify the

amount of SMOTE and number of nearest neighbors As

default configuration, the number of nearest neighbors was

maintained in 5, and 100 % of SMOTE instances were

cre-ated

Lastly, we explored another sampling approach in WEKA

filter module, namely Resampling This method refers to a

supervised filter that combines undersampling and

oversam-pling It works effectively in cases of large and sparse data

in which much noise must be eliminated In the

oversam-pling, a random replacement guarantees the same covariance

between two samples This filter can be made to bias the

class distribution toward a uniform distribution, wherefore

we selected a value of 1 In this step, we resampled a new

data with the same number of samples in the original training

set

Modeling procedure

According to the initial purpose, various SVM models were

developed to identify imbalance problems and to find out

optimal solutions Figure 1 describes our model building

sequence

As can be seen in Fig 1, a support vector classifier

was firstly developed with the original imbalanced dataset

(Primal-SVM) to identify the problems Performance of

internal calibration (by 10-fold cross-validation) was the

main criterion for the model selection The best variable

subset was selected using wrapper method [38] The

neg-ative effect of imbalanced data distribution was revealed

through comparing misclassification rates After identifying

the problem, eight models corresponding to 5 strategies for

combating the classification imbalance problem were con-structed For the analysis of algorithmic level methods, two models (Metacost and CSVM) were developed As described above, Metacost is a classifier implemented in WEKA that reweights the training samples according to the whole cost assigned to each class in the cost matrix The CSVM model was constructed by increasing the penalty constant associ-ated with misclassification rates of the H class with respect

to those of the M-P class At the data level, all subsample, oversample, and a combination thereof were analyzed For the subsampling method, the number of majority class (M-P

class) was halved applying SpreadSubsample filter

(mod-els of SS1 and SS2) The oversampling method was carried

out using SMOTE algorithm doubling number of minority

H class (SMOTE1 and SMOTE2) The Resample filter of

WEKA was applied for combining previous approaches (RS1 and RS2) The influence of the feature selection procedure

on the sampling techniques was analyzed when two models were building for each preprocessing filter The models num-bered 1 (SS1, SMOTE1, and RS1) were performed directly from the same variable set of Primal-SVM; meanwhile, the rest (SS2, SMOTE2, and RS2) are developed with the new variables selected from new balanced data distribution All models were presented to a revalidation process with original imbalanced training set (871 compounds) to check the AD Finally, an external test of 218 compounds was pre-dicted by each model for the robustness analysis of applied methods

As an additional interest, a consensus system was con-structed by voting predictions of 8 obtained models This simple ensemble method was analyzed according to the imbalance degree of classification results on test set The BCS permeability prediction of the 47 reference drugs was also discussed

Wrapper feature selection

In this study, the Wrapper approach was used for feature

selection Based on SVM algorithms, two searching

algo-rithms were used in sequence for wrapper: the hill-climbing (greedy) and the best-first method In the first step, we

per-formed a greedy backward search through the entire space

of attributes This method, so-called backward elimination,

starts with the full set of features and greedily adds or removes the one that most improves performance or degrades perfor-mance slightly (without backtracking) [39] In theory, going backward from full set of features may capture interaction among features more easily; however, the main drawback of this algorithm is its expensive computational cost In order

to improve hill-climbing feature subset selection, best-first

search was subsequently executed on the GreedyStepwise

results Essentially, best-first search selects the best vari-able from entire feature space and subsequently adds new

Trang 7

Fig 1 Strategies explored for

overcoming imbalanced Caco-2

data problem Models developed

with variable subset selected

from [a] Primal-SVM model

and [b] independently from

Primal-SVM model Asterisk all

sampling techniques were

performed only on training set

variables to the model so that this new subset still displays

significant improvement [38] This procedure will stop when

no improvement is found and the final variable subset will

be returned

All these feature selection methods were performed using

WEKA v.6.0 [29] For the GreedyStepwise method,

back-ward search was chosen rather than forback-ward

Classifier evaluation: applicability domain and

performance measurements

The AD is an important aspect for the evaluation of all

rebalancing approaches, especially when subsampling

strate-gies have been applied The need to define an AD for the

developed models is associated with their ability to

gen-erate reliable predictions in terms of chemical structures,

physicochemical properties, and mechanisms of action In

this regard, AD of each model was determined based on

three methods (ranges, Euclidean distance, and probability

density) integrated in AmbitDiscovery software [40]

As for the performance assessment, global accuracy(Q2)

is not an appropriated criterion to evaluate classifiers

devel-oped from imbalanced dataset [1] The main concern is

related to the prediction of high-permeable compounds For

this purpose, seven performance measures derived from the

confusion matrix were used (Table1)

Another criterion to assess the classification performance

is the Receiver Operator Curve (ROC) The ROC graphs are

two-dimensional graphs in which true positive rate, TPrate =

TP/(TP+FN), is plotted on the Y-axis, and false positive rate,

FPrate = FP/(FP+TN), is plotted on the X-axis by means of

the variation of decision threshold Here, we evaluated the

Table 1 Confusion matrix and common performance measures for the

classifier evaluation

Predicted H class Predicted M-P class Actual H class True positives (TP) False negatives (FN) Actual M-P class False positive (FP) True negatives (TN) Specificity (Sp) = TN/(TN+FP)

Sensitivity (Se) = Recall = TP/(TP+FN) Precision (Pr) = TP/(TP+FP)

Matthews correlation coefficient

(MCC) = (TP × TN − FN × FP)/[(TP +

Accuracy (Q) = (TN+TP)/(TN+TP+FN+FP)

quality of classifiers by mean of the area under the ROC curve, abbreviated AURC It has been shown that there is a clear similarity between AURC and well-known Wilcoxon statistics [41,42]

Statistical comparison of classifiers

In this study, various strategies have been explored in order

to address the imbalanced data classification challenges Obtained models have been analyzed in terms of classi-fication accuracy and rebalancing ability However, it still presents an elementary need to provide a general com-parison between classifiers that can lead us to a better understanding of strategy improvements [43] To do this well, several non-parametric statistical tests were performed [44] This comparison procedure starts with assessing

Trang 8

multi-comparison statistical tests using Friedman test [45] and

Iman-Davenport tests [46] with the null hypothesis that all

the classification models have no difference on average In

case of null hypothesis rejected in any of previous tests,

post-hoc tests were subsequently applied using Bonferroni-Dunn

test atα = 0.05 and 0.10 [47] Essentially, this test

mea-sures an average rank R j = 1/N i r i j corresponding to

each classifier and uses a critical distance (CD) to declare

the significant difference of ranked classifiers from the best

one [43] An advantage of the Bonferroni-Dunn test is that it

is easier to describe and to visualize because it uses the same

CD for all comparisons The value of CD can be computed

based on formula:

C D = q α

k(k + 1)

This procedure was performed using an in-house software

adapted from Demsar’s study [43] More detail of

methodol-ogy can be found in our previous comparative study of other

non-linear machine learning techniques [44,48]

Results and discussion

Identification of unbalanced data problem with

standard SVM algorithm

Primal-SVM model was obtained with 10 variables from the

original imbalanced training set of 871 compounds

Unsur-prisingly, this model was significantly biased toward M-P

compounds A set of 505 compounds from 611 cases were

correctly predicted as M-P class (about 83 % of Sp), while

only 176 instances were correctly classified as H class,

repre-senting a sensibility (Se) of 62.41 % for this class Although

overall accuracy of this model maintained acceptable level

with overall accuracy(Q) >78 %, this classifier cannot

be used for screening the high-permeable compounds The

relative low G-mean (about 73 %) reflected this situation.

In addition, the selected variables still correlated well with

permeability(MCC ∼ 0.5) Note that the model included

only 10 variables and used a relative low penalty(C = 320)

for classification errors as parameter optimization procedure

The imbalanced level of our data was 589/282 and the

Primal-SVM presented similar (label) bias in prediction outcome of

the dataset (611/260)

The above results showed that the class distribution was

the main reason for the low performance In addition, we did

not observe any small clusters in the k-MCA analysis;

there-fore, the within-class imbalance problem can be ruled out

After checking the behaviors of SVs on hyperplane surface,

a balanced number of them were appreciated (from 455 SVs,

215 are of H class and 240 are of M-P class), suggesting the

skewed distribution and overlapped data are the basic reasons behind the imbalance problem

Solutions at algorithmic level

As described above, Metacost classifier was developed

mod-ifying the misclassification costs associated to each class The performance of this model was slightly improved with respect to the Primal-SVM model The accuracy of H class prediction was 71.28 %, an increase of nearly 1.0 log unit from the first model (see Table 2) Using the same vari-able subset, only 193 (20 % of training set) boundary SVs

were found, suggesting that Metacost is the simpler classifier However, precision was relatively weak, and then F -measure

was still low

On the other hand, CSVM appeared to be effective in rebalancing the prediction results The TP rates were 77.30 and 77.08 % for H and M-P class, respectively However, increasing so much recall (Se) at the expense of precision (Pr) made the Pr measures drop completely (61 %), and the

over-all accuracy did not improve In comparison with Metacost,

CSVM performance seemed to be slightly better Generally,

at the algorithmic data level, Metacost and CSVM were

toler-able; however, the overall improvements were not significant, and there was a trade-off between Se and Pr measurements

Solutions at data level

Subsampling method

Firstly, 307 compounds belonging to M-P class were excluded from training set The remainders (564 compounds) have two classes equally distributed The same variables previously selected for Primal-SVM were used to develop SS1, while for SS2 a new variable selection was processed (Table2)

In comparison with Primal-SVM, the performances of SS1 and SS2 models were better rebalanced, since the dis-tribution of data had been changed However, the difference

of variable subsets selected by SS1 and SS2 models clearly affected model performance The accuracy classification of M-P class by SS1 model was very low (72.70 %) in com-parison with Primal-SVM (85.74 %) The overall accuracy was therefore lower than Primal-SVM Meanwhile, SS2 per-formance was significantly improved in comparison with Primal-SVM A further analysis of the validation results of these 2 models on overall training and test sets is neces-sary to confirm the robustness of obtained models, because there were 307 “hidden” compounds out of current training process

Trang 9

Table 2 Performance of SVM models under 10-fold cross-validation following different strategies

Algorithmic level Metacost (10) 193/871 85.43 71.28 63.81 75.82 67.34 0.51 77.61

a SVMs obtained by 10-fold cross-validation method, quantity between parenthesis indicates the number of variables included in each classifier

b Proportion between number of support vectors and number of training set (original and resampling set) Results of C optimization procedure by cross-validation: Primal-SVM (320), Metacost (510), CSVM (200 for H class and 100 for M-P class), SS1 (1100), SS2 (421.05), SMOTE1 (810), SMOTE2 (540), RS1 (490) and RS2 (572.73)

Oversampling method

After creating 218 new “artificial” cases by the SMOTE

method, training set size became greater with 1153

“com-pounds.” The performance of SMOTE1 and, specially,

SMOTE2, was significantly rebalanced and improved in

comparison with Primal-SVM (Table 2) The SMOTE2

model exhibited highly balanced performance with F

-measures >81 % and TPrate of >83 % From a simple

visualization, it is clear that oversample was one of the most

potential strategies to treat imbalance data Above all, the

problem of reducing AD could be ruled out However, since

these models considered an additional distribution and the

FN number of SMOTE1 and SMOTE2 were 110 and 95,

respectively, which were similar to Primal-SVM (106), a

revalidation on original training set distribution is necessary

for recognizing the real discriminate capacity of these

mod-els

Combination of simple subsample and oversample

strategy: resampling approach

For generating a new balanced distribution, resampling that

makes use of advance offered by subsampling and

oversam-pling methods was applied The set of 280 cases from H class

were randomly chosen and duplicated, while 411 M-P cases

were willfully extracted, giving a final balanced dataset of

the same number of original training set (871 compounds)

As can be appreciated in Table2, two RS1 and RS2

mod-els display similar and high performance Interestingly, the

distributions of support vectors (H vs M-P classes) were the

same as well as the number of variables selected in RS1 and

RS2 in comparison with Primal-SVM model Additionally,

with high MCC and G-mean values, current strategy should

be a promising solution for overcoming unbalance problem

Similarly, to other data-based technique, validation on both original training and test set is essential to recognize the real effectiveness of this strategy

Validation of classification models

Two validation processes were carried out: (i) on the original imbalanced dataset and (ii) on the external test set In the revalidation of training set, besides calculating performance measures, the number of compounds out of model AD is taken into account Results in the test set validation are pre-sented in Table3 A consensus model was also developed The models developed by each strategy were compared with the Primal-SVM Unsurprisingly, models based on algo-rithm modifications only showed a slight improvement, while models of the other category are all noticeably better Among the first group, CSVM still performs better than Metacost

on the test set Conversely, subsample, oversample, and

a combination thereof should be appropriately applied for overcoming imbalanced data problems Note that there was

an acceptable number (≤10) of compounds out of the ADs determined by all models Therefore, the possibility of throw-ing out valuable information, which remained our principal concern when applying undersampling techniques, could

be discarded by applying currently proposed workflow An interesting finding was that the variable selection procedure significantly impacted on the predictive capability of classi-fication models, although the cross-validation results of the same algorithm seemed to be similar That is, when sampling methods were applied, the employment of the new variable set selected from new balanced data distribution should not overcome the models that the selected variables raised from the original imbalanced distribution data The 10 variables selected by Primal-SVM showed a surprisingly predictive ability on the test set, especially when an appropriated

Trang 10

sam-Table 3 Validation performance of obtained models on test set

a Number of compound out of applicability domain of (training set/test set)

pling strategy was applied Note that, two classes of the test

set display the same unbalanced distribution of the

origi-nal training set Aorigi-nalyzing the prevalence (ratio between TP

and TN) of SMOTE1 and RS1, we observed a highly

bal-anced degree, such as 80.00 %/81.41 % in SMOTE1 and

81.43/82.76 % in RS1, suggesting that these two strategies

are suitable for current unbalanced data

Interestingly, AURC analysis did not give us a clear

con-clusion on the performance improvement from Primal-SVM

(see Tables2and3) There is a little change from the first

model, although the prevalence on the predictions has been

rebalanced A previous study showed that the proportion of

positive to negative instances in the test set does not affect

to the ROC curve [49] Here, on the unbalanced distribution

of test set, the use of AURC for selecting an appropriated

strategy to apply might be misleading The AURC might not

depend on class distribution, but the overlapped nature of

data

Finally, a consensus model was constructed based on

vot-ing mechanism In general, this model slightly outperforms

the best standalone models (SMOTE1 and RS1) with 80.81 %

of G and 73.33 % of F1 As displayed in Table3, this

consen-sus model displayed great advantage over other classifiers

in covering AD Furthermore, using current multiclassifier

could eliminate the concern about choosing the wrong

solu-tion [44,48]

Statistical comparison of classifiers

In order to illustrate the differences between the obtained

models, all performances of nine SVM models in the test

set validation were subjected to multiple comparison

pro-cedures As the first step, the average ranks corresponding

to each classifier were calculated Accordingly, all

clas-sifiers were ranked as follows: RS1 – RS2 – SMOTE1

– CSVM – SS1 – Metacost – SS2 – Primal-SVM (see

Fig 2 Rankings obtained through Friedman test and graphical

rep-resentation of Bonferroni-Dunn procedure considering RS1 as control model Significance levelα of 0.05 and 0.10 expressed as continuous

and dotted lines, respectively

Fig 2) The Friedman’s test null hypothesis was rejected

(p = 0.00) So there is a significant difference between

classifiers obtained The same result was observed by

Iman-Davenport with p < 0.0005 Subsequently, the post-hoc

Bonferroni–Dunn test (atα = 0.05 and 0.10) was applied in

order to reveal which classification model performed equiv-alently to the best-ranked model (RS1) At the end, the

CD values were 3.145 for α = 0.05 and 2.884 for α =

0.10.

In comparison to the lowest bar, which corresponded to the best model (RS1), CSVM, SS1, SMOTE1, and RS2 were considered to have similar performances, since any of them exceed the critical difference (CD) of Bonferroni–Dunn test Additionally, it is possible to identify models that are

significantly worse than the others, viz Primal-SVM and

SS2

Định dạng
Số trang	17
Dung lượng	708,75 KB