Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of overs
Trang 1DOI 10.1007/s11030-015-9649-4
F U L L - L E N G T H PA P E R
Exploring different strategies for imbalanced ADME data
problem: case study on Caco-2 permeability modeling
Hai Pham-The 1 · Gerardo Casañola-Martin 2,3,4 · Teresa Garrigues 5 ·
Marival Bermejo 6 · Isabel González-Álvarez 6 · Nam Nguyen-Hai 1 ·
Miguel Ángel Cabrera-Pérez 5,6,7 · Huong Le-Thi-Thu 8
Received: 23 May 2015 / Accepted: 13 November 2015
© Springer International Publishing Switzerland 2015
Abstract In many absorption, distribution, metabolism,
and excretion (ADME) modeling problems, imbalanced
data could negatively affect classification performance of
machine learning algorithms Solutions for handling
imbal-anced dataset have been proposed, but their application for
ADME modeling tasks is underexplored In this paper,
var-ious strategies including cost-sensitive learning and
resam-pling methods were studied to tackle the moderate imbalance
problem of a large Caco-2 cell permeability database
Sim-Electronic supplementary material The online version of this
article (doi: 10.1007/s11030-015-9649-4 ) contains supplementary
material, which is available to authorized users.
B Huong Le-Thi-Thu
ltthuong1017@gmail.com
1 Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi,
Vietnam
2 Departament de Bioquímica i Biologia Molecular, Universitat
de València, Burjassot, 46100 Valencia, Spain
3 Unidad de Investigación de Diseño de Fármacos y
Conectividad Molecular, Departamento de Química Física,
Facultad de Farmacia, Universitat de València, Valencia,
Spain
4 Facultad de Ingeniería Ambiental, Universidad Estatal
Amazónica, Paso lateral km 2 1/2 via Napo, Puyo, Ecuador
5 Department of Pharmacy and Pharmaceutical Technology,
University of Valencia, Burjassot, 46100 Valencia, Spain
6 Department of Engineering, Area of Pharmacy and
Pharmaceutical Technology, Miguel Hernández University,
03550 Sant Joan d’Alacant, Alicante, Spain
7 Unit of Modeling and Experimental Biopharmaceutics,
Chemical Bioactive Center, Central University of Las Villas,
54830 Santa Clara, Villa Clara, Cuba
8 School of Medicine and Pharmacy, Vietnam National
University, 144 Xuan Thuy, Hanoi, Vietnam
ple physicochemical molecular descriptors were utilized for data modeling Support vector machine classifiers were con-structed and compared using multiple comparison tests Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of oversampling data where misclassification rates for minority class have values of 0.11 and 0.14 for training and test set, respectively A consensus model with enhanced applicability domain was subsequently constructed and showed improved performance This model was used to predict a set of ran-domly selected high-permeability reference drugs according
to the biopharmaceutics classification system Overall, this study provides a comparison of numerous rebalancing strate-gies and displays the effectiveness of oversampling methods
to deal with imbalanced permeability data problems
Keywords ADME modeling· Caco-2 cell permeability · Biopharmaceutics classification system· Support vector machine· Cost-sensitive learning · Resampling technique
Abbreviations
AD Applicability domain ADME Absorption, distribution, metabolism, and
excre-tion AURC Area under the ROC BCS Biopharmaceutics classification system
BE Bioequivalence
C Penalty parameter
CD Critical distance EMA European medicines agency
F Bioavailability
FN False negative
Trang 2FDA US Food and Drug Administration
FP False positive
H class High-permeability class
HIA Human intestinal absorption
IVIVC In vitro–In vivo correlation
MD Molecular descriptor
M-P class Moderate-to-poor permeability class
Papp Apparent permeability coefficient
RBF Radial basis function
ROC Receiver operator curve
SMOTE Synthetic minority oversampling technique
SVM Support vector machine
SVs Support vectors
WHO World health organization
Introduction
Issues associated with imbalanced class distribution are
fre-quently encountered in real-world applications of machine
learning and data mining methods [1] As reported in the
literature, data imbalance-related issues normally originate
from two distinct problems: (i) the interclass imbalance,
where the distribution of class labels varies widely and (ii)
the within-class or intraclass imbalance, where the
distrib-ution of members within each class is unequal [2] In most
applications, independently from imbalance degree,
classi-fiers tend to learn from prevalent classes, while ignoring the
small classes Consequently, the overall predictions often
have bias toward the majority class and such “apparent” high
overall accuracy is meaningless when we also consider the
minority class [1]
Numerous strategies have been proposed to handle
imbal-anced dataset In general, these could be grouped in two
categories based on their rebalancing targets: (i)
reconsid-ering the misclassification cost or cost-sensitive learning and
(ii) creating more balanced class distribution in the training
set or data sampling [2] Since the cost-sensitive approaches
require modification at the algorithm level, so resampling
methods that only make change on the data distribution are
considered more applicable [3] However, to date there is
little evidence showing that one strategy is better than the
other Consequently, a comparative analysis is necessary to
draw valid conclusions when applying a strategy
On the other hand, for the classification of many ADME
properties, the collected data often display imbalanced class
distribution Evidence can be found in reported studies
of blood-brain barrier (BBB) penetration [4], adenosine
triphosphate (ATP)-binding cassette (ABC) transporter [5],
protein binding [4], Cytochrome P450 (CYP)
enzyme-substrate/inhibitor specificity [6], human intestinal
absorp-tion (HIA) [7,8], bioavailability (F), and so on [4] In most
cases, researchers agreed that the class imbalance problem
is critical in data modeling Therefore, the management of imbalanced dataset should receive special attention in order
to improve and rebalance model performance Solutions for imbalanced data exist, but their applications in ADME mod-eling are still limited Therefore, an exploratory analysis of the most common strategies reported to deal with the imbal-anced ADME data problem is needed
Among ADME properties, permeability, the capability
of a drug to penetrate across the human gastrointestinal tract (GIT), is a key factor governing human intestinal absorption (HIA) [9] In vitro models, such as MDCK (Madin-Darby Canin Kidney) and Caco-2 (adenocarcinoma cells derived from colon), are widely used as high throughput screening (HTS) methods for the permeability assessment
of drug-like molecules in the early stages of drug discov-ery [10] Especially, the Caco-2 monolayer, which exhibits morphological as well as functional similarities to human intestinal enterocytes, is considered as a better surrogate marker for estimating in vivo drug absorption than other epithelial cell cultures [11,12] Currently, Caco-2 monolay-ers are recommended by numerous regulatory agencies for the permeability classification of drugs according to the bio-pharmaceutics classification system (BCS) [11,13,14] Prior to running in vitro Caco-2 assays, reliable per-meability prediction through the use of computational (in silico) tools is very useful, e.g., to prioritize compounds to
be tested as well as to guide the structural optimization in order to improve the absorption profile of lead compounds However, accurate in silico permeability prediction is one
of the most difficult tasks in ADME modeling [12,15,16] Data reported in the literature show considerable inter- and intra-laboratory variability [17] We therefore consider that permeability data analysis by classification is better than regression methods By dividing the dataset into two groups
(high vs medium–low permeability), the within-group
vari-ability of the experimental data could be greatly reduced
In our previous studies, we developed various classification models based on a large and heterogeneous Caco-2 perme-ability database compiled from different sources [12,15] Unsurprisingly, the results supported our hypothesis that classification models could overcome the data variability problem and accurately estimate experimental in vitro mea-surements These studies also demonstrated the potential of quantitative structure-property relationship (QSPR) models
in the HIA prediction of compounds that undergo passive transport mechanisms In this regard, a suitable
permeabil-ity cut-off value that maximizes in vitro–in vivo correlation
(IVIVC) should be defined [18] Given that a threshold value plays an arbitrary role that affects directly the data skewness and model performance, it is of the utmost importance that imbalance problems should be preliminarily treated in order
to develop accurate permeability classification models [11]
Trang 3Based on the challenges mentioned above, the main goals
of this work are as follows: (i) to investigate the
poten-tial of the most common rebalancing strategies (at the data
and the algorithm levels) reported in the literature for the
prediction of a large and imbalanced Caco-2 permeability
database; (ii) to obtain reliable classification models that
improve the prediction of high-permeable compounds
tak-ing into account the permeability class definitions of the
BCS; and (iii) to corroborate the in silico predictions with in
vitro Caco-2 permeability of 47 drugs belonging to different
classes of the BCS Particularly, we briefly review the
prin-ciples of all the rebalancing strategies, concentrating on the
way they should be applied for handling imbalanced
perme-ability data classification problems, using standard machine
learning technique, such as support vector machine (SVM)
Materials and methods
Data collection and permeability class definition
In this study, a large and heterogeneous database composed of
1116 organic compounds was carefully assembled from more
than 320 published articles The data collection was
rou-tinely performed taking into account those factors that could
mainly contribute to the variability of experimental results as
described in the literature [11,12,15,17,18] Since a number
of compounds appeared with two or more in vitro assays, the
mean values were employed, excluding those laid outside of
the mean± 2SD (standard deviation) ranges In a
prelimi-nary inspection of the dataset, 12 compounds with very low
or very high molecular weight (MW≤ 60 or MW ≥ 1400
Da) were excluded from the modeling set because they are
most likely to cross the cell membrane via carrier-mediated
(active) transport [15] Furthermore, after calculating
molec-ular descriptors, a set of 14 quaternary amines and a sodium
salt appeared with some missing descriptor values was also
excluded from our modeling procedure (see Supplemental
Table S3) The remaining 1089 molecules, having
differ-ent physicochemical characteristics such as molecular size,
polarizability, hydrophilicity, lipophilicity, and molecular
charge, were used to develop global classification models
for the prediction of Caco-2 cell permeability
Based on the guidance provided by the US Food and Drug
Administration (FDA) for the application of in vitro
per-meability data in the context of BCS, we defined the high
permeability class that maximizes the fundamental
trade-off between in vitro permeability and human oral absorption
[14] In the literature, Metoprolol (average apparent
perme-ability Papp = 20 × 10−6cm/s with HIA rate of 96 %) is
widely used as a reference compound to discriminate high
from low permeable drugs However, current definitions of
high permeability class based on an HIA value of 90 % are
arguably too constrained [19] Therefore, a new threshold was adapted taking into account the lower confidence inter-val rule (0.8–1.25) commonly recommended by the FDA for assessing the bioequivalence (BE) of drug products (available
at http://www.fda.gov/cder/orange/default.htm) [11] This method was successfully developed by Kim et al to differen-tiate between high and low in situ permeability compounds, using the 90 % confidence interval of the permeability ratio
of the test compound to the reference compound like Meto-prolol [20] Finally, 352 compounds were assigned to the high-permeability class (H class), while the remaining 737 compounds belonged to the moderate-to-poor permeability class (M-P class) The latter class outnumbered the H class
by more than 2 times This is a typical case of moderately imbalanced binary dataset with overlapped classes
Computational method
Molecular descriptor calculation
Oral absorption is a complex process affected by physic-ochemical properties of the drug, drug formulation, and gastrointestinal physiology factors [9] Therefore, physic-ochemical descriptors should be preferentially considered for the construction of classification models In this study,
115 molecular descriptors (MDs) belonging to 6 different families (constitutional, ring, functional group counts, atom-centered fragments, charge descriptors, molecular properties and drug-like) were calculated using the SMILES code of each compound as input to the DRAGON software v.6.0
[21] For the calculation of charge descriptors, a preliminary semi-empirical PM3 structural optimization based on the Polak–Ribiere algorithm implemented in Hyperchemv.8.0
[22] was performed for each compound
Selection of training and test sets
The selection of training and test sets was performed using
k-means cluster analysis (k-MCA) implemented in
STATIS-TICAv.8.0 [23] Firstly, the dataset was split into kdifferent
clusters(k < 20) of the highest possible dissimilarity The
Fisher ratio and p-level of significance (p < 0.05) were
con-sidered to select the optimal number of variables included in the analysis and the number of clusters that represented the structural information of the dataset
Subsequently, compounds of the training and test sets were randomly selected from previous clusters In this procedure, the linkage distance of the members in each cluster was taken into account in such way that for each compound in
train-ing set there is always a similar compound in the test set.
Finally, the dataset was divided in two sets of 871 and 218 compounds, which corresponded to the training and test sets,
Trang 4respectively The test set was never used in the development
of classification models
Additionally, a set of 47 compounds with experimental
Caco-2 permeability was collected to corroborate with
com-putational predictions Many of them are well-studied drugs
that have been commonly used as internal reference
stan-dards in Caco-2 cell permeability assays for testing new
compounds The prediction of this set is very useful for the
demonstration of the suitability and validity of the proposed
threshold in the context of the BCS
Support vector machine and effects of imbalanced data
on model construction
Support vector machine (SVM) was used for model building
in this work Since the pioneering theoretical studies of
Vap-nik and Burges [24,25], SVM technique has gained extensive
popularity in many research fields, particularly for
model-ing ADME and toxicity data SVM embodies the structural
risk minimization (SRM) principle which takes into account
the capacity of the classifier (similar to model complexity)
and the trade-off between minimization of training error and
reduction of the model complexity [24] A more detailed
description of the SVM theory can be widely found in the
literature Herein, we only describe briefly basic SVM theory
and methods to resolve imbalanced problem based on SVM
algorithm
Generally, each object in SVM is described by an input
vector x i of k real numbers (features or descriptors), which
corresponds to a point in a k-dimensional space If the number
of training set is n, then each case is described as X T r ai n =
(x i , y i ) , i = 1, n, where x i = (x i 1 , x i 2 , , x i k ) are
input vectors Response variable y i = +1/ − 1 corresponds
to H and M-P classes in this work Here we applied the
classi-fication SVM type 1 to find an optimal separating hyperplane
(maximum margin) by solving the following primal
opti-mization equation [26]:
Mi n
w,b,ξ
1
2w T w + C
k
i=1
ξ i subject to y i
w T φ (x i ) + b
where C is the capacity constant, w is the vector of
coef-ficients, b a constant, and ξ are parameters for handling
non-separable input data (non-negative slack variable ξ ≥
0, i = 1, k) The term φ(x i ) is a feature function (reverse
to kernel function) that maps x i into a higher dimensional
space It should be noted that the parameter C, also called
penalty, represents the trade-off between the empirical error
and the margin In theory, when a large value of C is set, the
model optimization procedure will choose a small-margin
for optimum separation hyperplane (OSH) so that the final
model is less sensitive to the errors and keeps the number
of misclassification small However, increasing C too much
can make the model loss the generalization capability and
easily overfit Choosing an appropriate value of C is one of
the main tasks for developing good SVM classifier [24] The
Eq.1can be solved using Lagrange multipliers and the final classification function is
sgn
w T φ (x) + b= sgn
k
i=1
y i α i K (x i , x) + b
(2)
For linearly separable classes, it is not necessary to
intro-duce slack variable or feature function In this case, the
optimization problem from Eq.1 can be solved by a sim-ple quadratic function under linear constraints Conversely, for the classification of linearly non-separable data, a kernel
function K (x i , x) that maps the input variable in a higher
dimensional space is defined in order to determine the shape
of OSH Several kernel functions have been used, such as linear, polynomial, radial basis function (RBF), and sig-moid kernel Among them, RBF is the most widely used and achieves, almost invariably, good performance [27]
K
x i , x j
= exp−γ x i − x j . 2
(3)
In this work, the RBF kernel was considered due to its advantages regarding to other kernels [27] The value of gamma (γ ) was set to 1 during the model optimization.
In order to determine the best value of parameter C, a
selection procedure using 10-fold cross-validations was per-formed For the construction of SVM models, we applied the extreme decomposition algorithm of Platt (SMO) and libSVM [26,28] implemented in WEKAv.6.0 [29] The
prin-ciple of maximal parsimony (Occam’s razor) was taken into
account Consequently, only classifiers displaying good per-formance with fewer variables and fewer support vectors (SVs) were selected
Various studies have analyzed the negative effect of data imbalance on SVM performance [30,31] In summary, three main problems were pointed out The first one is the bound-ary skewness In this problem, the imbalanced ratio of two classes in training set makes that the minority examples reside away from the “ideal boundary.” Then SVM tends to learn
a boundary that is too close to these instances and displays therefore imbalanced performance The second one is related
to the constant C specification, also known as weakness of
soft-margins [30] As can be seen from Eq.1, the unique
trade-off C assumed for two unbalanced classes does not
account for the different cumulative errors between classes
Veropoulos et al proposed a strategy to increase C associated
with the minority class [32] The last effect is the imbalance of
SV ratio It is believed that this imbalance can make the
Trang 5clas-sification of a test instance close to the separation hyperplane
skew toward the majority class [33] However, according
to the Karush–Kuhn–Tucker (KKT) conditions [24], theα i
terms in Eq 2 which act as weights in the final decision
function must satisfy
0≤ α i ≤ C and
k
i=1
As described in Eq (4), theα ivalues corresponding to the
minority class must be larger than those of the majority class
Therefore, SVs of the smaller class generally receive higher
weights than the prevalent class, which can partly reduce the
effect of imbalance in SVs [30] Herein, the classification
imbalance was analyzed by comparing the number and the
ratio between the SVs corresponding to H and M-P classes
Solutions for imbalanced data problem
There are two main approaches for dealing with imbalanced
data problem: algorithm learning level (cost-based) and data
level (resampling) strategies
Algorithmic level strategies
At the algorithmic level, choosing an appropriate inductive
bias for target class is a common strategy As described above,
penalty constants associated with classification error for each
class should be modified Consequently, H class was assigned
with a higher penalty than M-P class, and Eq.1becomes
Mi n
w,b,ξ
1
2w T w + C H
y i=1
ξ i + C M −P
y i=−1
ξ i subject to y i
w T φ (x i ) + b≥ 1 − ξ i
where C H and C M −P are different penalty constants
asso-ciated with error terms for the H class and M-P class,
respectively For this method, we used the C-SVM algorithm
of LibSVM integrated in the WEKA environment where
dif-ferent weight parameters can be added for each class [26]
In this case, for two classes, we considered the ratio of the
number of compounds in every class to estimate the suitable
weights Corollary of this setting, “weights” parameter was
“0.5 − 1,” the penalty parameters for M-P and H classes are
0.5×C and 1×C, respectively (the C optimization procedure
was described above)
Another algorithmic level method is the cost-sensitive
learning In theory, standard classification algorithms assume
balanced data distribution and level of importance for each
class Therefore, standard data mining techniques could be
inefficient when trying to predict a minority class in an imbal-anced dataset or when a false negative (FN) is considered to
be more important than a false positive (FP) [34] Regard-ing this problem, cost-sensitive classifiers that take the cost matrix into consideration during model building and gener-ate models with the lowest expected cost were employed The cost matrix can be seen as an overlay to the standard confusion matrix In this case, the user can control the FN and FP rate by means of increasing the misclassification cost
of minority (H) class compounds with respect to the cost for
misclassifying majority (M-P) class Metacost was selected
to build models based on criteria of a previous study [35] This method re-labels the training samples with their min-imum expected cost classes, and then rebuilds the models based on modified training set For WEKA-SVM, it is
recom-mended using the BuildLogisticModels procedure for better
estimation of probability [34]
We gradually augmented the FN cost 20 times in stages from 1.5 to 2.5, while maintaining a cost of 1 for FP, and subsequently compared current FN rates to the original SVM performance without any modified cost These differences were incremented until 10 % for the cross-validation model, which coincided with a cost of 1.8FN The misclassification cost of H class was set to 1.8, while M-P class was maintained
at the value of 1
Data level strategies
At the data level, several approaches to resample (over- or undersample) the original data were used in this work The resampling approaches are well known to be more flexible than cost-based methods [36] However, there are some lim-itations of resampling approaches that one should take into account before developing computational models Firstly,
in medicinal chemistry, changing the number of instances could lead to different ADs in predictive space, especially with undersampling that can affect the robustness of in sil-ico models Secondly, since the optimal class distribution of the training data is usually unknown, learning from a forced sample-distribution may not reflect the real- world population distribution, which used to be randomly drawn Nevertheless, resampling continues to be the most widely used strategy because it could approximate the target population (rebal-ancing sample) better than the original (biased sample), even though random effects can no longer be considered [30] For the undersampling approach, a random subsample
of dataset was applied using supervised SpreadSubsample
filter in WEKA This technique allows us to specify the maximum “spread” between the rarest and most common class It means the selected subsample not only can achieve
a desirable uniform distribution of two classes, but also pre-serves the relationship among classes in original data sets (the maximum class distribution spread) Therefore, “in theory,”
Trang 6this method may be useful in rebalancing data distribution
without throwing away valid instances In this work, after
choosing a distribution spread of 1, the training set was
divided equally into two classes The weight instances were
not considered, so this factor cannot influence the global
error
Since undersampling data may cause information loss,
other approaches that keep the majority, or at least, remove
fewer instances are desirable A common strategy is based
on multiplication of the minority class, namely oversampling
method However, the main drawback of this method is the
possible over-fitting problem [30] Therefore, a sophisticated
method proposed by Chawla et al [37], namely SMOTE
(Synthetic Minority Oversampling Technique), was used in
this work to preprocess the data This approach generates
synthetic samples of minority class on the basis of
similar-ity principle In brief, each new minorsimilar-ity sample is created
from the two nearest neighbors and has similar structure
with them [37] Generally, SMOTE presents some
advan-tages over the random oversampling approach because of its
informed properties and rebalance capability without
caus-ing over-fittcaus-ing problem In WEKA, the user can specify the
amount of SMOTE and number of nearest neighbors As
default configuration, the number of nearest neighbors was
maintained in 5, and 100 % of SMOTE instances were
cre-ated
Lastly, we explored another sampling approach in WEKA
filter module, namely Resampling This method refers to a
supervised filter that combines undersampling and
oversam-pling It works effectively in cases of large and sparse data
in which much noise must be eliminated In the
oversam-pling, a random replacement guarantees the same covariance
between two samples This filter can be made to bias the
class distribution toward a uniform distribution, wherefore
we selected a value of 1 In this step, we resampled a new
data with the same number of samples in the original training
set
Modeling procedure
According to the initial purpose, various SVM models were
developed to identify imbalance problems and to find out
optimal solutions Figure 1 describes our model building
sequence
As can be seen in Fig 1, a support vector classifier
was firstly developed with the original imbalanced dataset
(Primal-SVM) to identify the problems Performance of
internal calibration (by 10-fold cross-validation) was the
main criterion for the model selection The best variable
subset was selected using wrapper method [38] The
neg-ative effect of imbalanced data distribution was revealed
through comparing misclassification rates After identifying
the problem, eight models corresponding to 5 strategies for
combating the classification imbalance problem were con-structed For the analysis of algorithmic level methods, two models (Metacost and CSVM) were developed As described above, Metacost is a classifier implemented in WEKA that reweights the training samples according to the whole cost assigned to each class in the cost matrix The CSVM model was constructed by increasing the penalty constant associ-ated with misclassification rates of the H class with respect
to those of the M-P class At the data level, all subsample, oversample, and a combination thereof were analyzed For the subsampling method, the number of majority class (M-P
class) was halved applying SpreadSubsample filter
(mod-els of SS1 and SS2) The oversampling method was carried
out using SMOTE algorithm doubling number of minority
H class (SMOTE1 and SMOTE2) The Resample filter of
WEKA was applied for combining previous approaches (RS1 and RS2) The influence of the feature selection procedure
on the sampling techniques was analyzed when two models were building for each preprocessing filter The models num-bered 1 (SS1, SMOTE1, and RS1) were performed directly from the same variable set of Primal-SVM; meanwhile, the rest (SS2, SMOTE2, and RS2) are developed with the new variables selected from new balanced data distribution All models were presented to a revalidation process with original imbalanced training set (871 compounds) to check the AD Finally, an external test of 218 compounds was pre-dicted by each model for the robustness analysis of applied methods
As an additional interest, a consensus system was con-structed by voting predictions of 8 obtained models This simple ensemble method was analyzed according to the imbalance degree of classification results on test set The BCS permeability prediction of the 47 reference drugs was also discussed
Wrapper feature selection
In this study, the Wrapper approach was used for feature
selection Based on SVM algorithms, two searching
algo-rithms were used in sequence for wrapper: the hill-climbing (greedy) and the best-first method In the first step, we
per-formed a greedy backward search through the entire space
of attributes This method, so-called backward elimination,
starts with the full set of features and greedily adds or removes the one that most improves performance or degrades perfor-mance slightly (without backtracking) [39] In theory, going backward from full set of features may capture interaction among features more easily; however, the main drawback of this algorithm is its expensive computational cost In order
to improve hill-climbing feature subset selection, best-first
search was subsequently executed on the GreedyStepwise
results Essentially, best-first search selects the best vari-able from entire feature space and subsequently adds new
Trang 7Fig 1 Strategies explored for
overcoming imbalanced Caco-2
data problem Models developed
with variable subset selected
from [a] Primal-SVM model
and [b] independently from
Primal-SVM model Asterisk all
sampling techniques were
performed only on training set
variables to the model so that this new subset still displays
significant improvement [38] This procedure will stop when
no improvement is found and the final variable subset will
be returned
All these feature selection methods were performed using
WEKA v.6.0 [29] For the GreedyStepwise method,
back-ward search was chosen rather than forback-ward
Classifier evaluation: applicability domain and
performance measurements
The AD is an important aspect for the evaluation of all
rebalancing approaches, especially when subsampling
strate-gies have been applied The need to define an AD for the
developed models is associated with their ability to
gen-erate reliable predictions in terms of chemical structures,
physicochemical properties, and mechanisms of action In
this regard, AD of each model was determined based on
three methods (ranges, Euclidean distance, and probability
density) integrated in AmbitDiscovery software [40]
As for the performance assessment, global accuracy(Q2)
is not an appropriated criterion to evaluate classifiers
devel-oped from imbalanced dataset [1] The main concern is
related to the prediction of high-permeable compounds For
this purpose, seven performance measures derived from the
confusion matrix were used (Table1)
Another criterion to assess the classification performance
is the Receiver Operator Curve (ROC) The ROC graphs are
two-dimensional graphs in which true positive rate, TPrate =
TP/(TP+FN), is plotted on the Y-axis, and false positive rate,
FPrate = FP/(FP+TN), is plotted on the X-axis by means of
the variation of decision threshold Here, we evaluated the
Table 1 Confusion matrix and common performance measures for the
classifier evaluation
Predicted H class Predicted M-P class Actual H class True positives (TP) False negatives (FN) Actual M-P class False positive (FP) True negatives (TN) Specificity (Sp) = TN/(TN+FP)
Sensitivity (Se) = Recall = TP/(TP+FN) Precision (Pr) = TP/(TP+FP)
Matthews correlation coefficient
(MCC) = (TP × TN − FN × FP)/[(TP +
Accuracy (Q) = (TN+TP)/(TN+TP+FN+FP)
quality of classifiers by mean of the area under the ROC curve, abbreviated AURC It has been shown that there is a clear similarity between AURC and well-known Wilcoxon statistics [41,42]
Statistical comparison of classifiers
In this study, various strategies have been explored in order
to address the imbalanced data classification challenges Obtained models have been analyzed in terms of classi-fication accuracy and rebalancing ability However, it still presents an elementary need to provide a general com-parison between classifiers that can lead us to a better understanding of strategy improvements [43] To do this well, several non-parametric statistical tests were performed [44] This comparison procedure starts with assessing
Trang 8multi-comparison statistical tests using Friedman test [45] and
Iman-Davenport tests [46] with the null hypothesis that all
the classification models have no difference on average In
case of null hypothesis rejected in any of previous tests,
post-hoc tests were subsequently applied using Bonferroni-Dunn
test atα = 0.05 and 0.10 [47] Essentially, this test
mea-sures an average rank R j = 1/N i r i j corresponding to
each classifier and uses a critical distance (CD) to declare
the significant difference of ranked classifiers from the best
one [43] An advantage of the Bonferroni-Dunn test is that it
is easier to describe and to visualize because it uses the same
CD for all comparisons The value of CD can be computed
based on formula:
C D = q α
k(k + 1)
This procedure was performed using an in-house software
adapted from Demsar’s study [43] More detail of
methodol-ogy can be found in our previous comparative study of other
non-linear machine learning techniques [44,48]
Results and discussion
Identification of unbalanced data problem with
standard SVM algorithm
Primal-SVM model was obtained with 10 variables from the
original imbalanced training set of 871 compounds
Unsur-prisingly, this model was significantly biased toward M-P
compounds A set of 505 compounds from 611 cases were
correctly predicted as M-P class (about 83 % of Sp), while
only 176 instances were correctly classified as H class,
repre-senting a sensibility (Se) of 62.41 % for this class Although
overall accuracy of this model maintained acceptable level
with overall accuracy(Q) >78 %, this classifier cannot
be used for screening the high-permeable compounds The
relative low G-mean (about 73 %) reflected this situation.
In addition, the selected variables still correlated well with
permeability(MCC ∼ 0.5) Note that the model included
only 10 variables and used a relative low penalty(C = 320)
for classification errors as parameter optimization procedure
The imbalanced level of our data was 589/282 and the
Primal-SVM presented similar (label) bias in prediction outcome of
the dataset (611/260)
The above results showed that the class distribution was
the main reason for the low performance In addition, we did
not observe any small clusters in the k-MCA analysis;
there-fore, the within-class imbalance problem can be ruled out
After checking the behaviors of SVs on hyperplane surface,
a balanced number of them were appreciated (from 455 SVs,
215 are of H class and 240 are of M-P class), suggesting the
skewed distribution and overlapped data are the basic reasons behind the imbalance problem
Solutions at algorithmic level
As described above, Metacost classifier was developed
mod-ifying the misclassification costs associated to each class The performance of this model was slightly improved with respect to the Primal-SVM model The accuracy of H class prediction was 71.28 %, an increase of nearly 1.0 log unit from the first model (see Table 2) Using the same vari-able subset, only 193 (20 % of training set) boundary SVs
were found, suggesting that Metacost is the simpler classifier However, precision was relatively weak, and then F -measure
was still low
On the other hand, CSVM appeared to be effective in rebalancing the prediction results The TP rates were 77.30 and 77.08 % for H and M-P class, respectively However, increasing so much recall (Se) at the expense of precision (Pr) made the Pr measures drop completely (61 %), and the
over-all accuracy did not improve In comparison with Metacost,
CSVM performance seemed to be slightly better Generally,
at the algorithmic data level, Metacost and CSVM were
toler-able; however, the overall improvements were not significant, and there was a trade-off between Se and Pr measurements
Solutions at data level
Subsampling method
Firstly, 307 compounds belonging to M-P class were excluded from training set The remainders (564 compounds) have two classes equally distributed The same variables previously selected for Primal-SVM were used to develop SS1, while for SS2 a new variable selection was processed (Table2)
In comparison with Primal-SVM, the performances of SS1 and SS2 models were better rebalanced, since the dis-tribution of data had been changed However, the difference
of variable subsets selected by SS1 and SS2 models clearly affected model performance The accuracy classification of M-P class by SS1 model was very low (72.70 %) in com-parison with Primal-SVM (85.74 %) The overall accuracy was therefore lower than Primal-SVM Meanwhile, SS2 per-formance was significantly improved in comparison with Primal-SVM A further analysis of the validation results of these 2 models on overall training and test sets is neces-sary to confirm the robustness of obtained models, because there were 307 “hidden” compounds out of current training process
Trang 9Table 2 Performance of SVM models under 10-fold cross-validation following different strategies
Algorithmic level Metacost (10) 193/871 85.43 71.28 63.81 75.82 67.34 0.51 77.61
a SVMs obtained by 10-fold cross-validation method, quantity between parenthesis indicates the number of variables included in each classifier
b Proportion between number of support vectors and number of training set (original and resampling set) Results of C optimization procedure by cross-validation: Primal-SVM (320), Metacost (510), CSVM (200 for H class and 100 for M-P class), SS1 (1100), SS2 (421.05), SMOTE1 (810), SMOTE2 (540), RS1 (490) and RS2 (572.73)
Oversampling method
After creating 218 new “artificial” cases by the SMOTE
method, training set size became greater with 1153
“com-pounds.” The performance of SMOTE1 and, specially,
SMOTE2, was significantly rebalanced and improved in
comparison with Primal-SVM (Table 2) The SMOTE2
model exhibited highly balanced performance with F
-measures >81 % and TPrate of >83 % From a simple
visualization, it is clear that oversample was one of the most
potential strategies to treat imbalance data Above all, the
problem of reducing AD could be ruled out However, since
these models considered an additional distribution and the
FN number of SMOTE1 and SMOTE2 were 110 and 95,
respectively, which were similar to Primal-SVM (106), a
revalidation on original training set distribution is necessary
for recognizing the real discriminate capacity of these
mod-els
Combination of simple subsample and oversample
strategy: resampling approach
For generating a new balanced distribution, resampling that
makes use of advance offered by subsampling and
oversam-pling methods was applied The set of 280 cases from H class
were randomly chosen and duplicated, while 411 M-P cases
were willfully extracted, giving a final balanced dataset of
the same number of original training set (871 compounds)
As can be appreciated in Table2, two RS1 and RS2
mod-els display similar and high performance Interestingly, the
distributions of support vectors (H vs M-P classes) were the
same as well as the number of variables selected in RS1 and
RS2 in comparison with Primal-SVM model Additionally,
with high MCC and G-mean values, current strategy should
be a promising solution for overcoming unbalance problem
Similarly, to other data-based technique, validation on both original training and test set is essential to recognize the real effectiveness of this strategy
Validation of classification models
Two validation processes were carried out: (i) on the original imbalanced dataset and (ii) on the external test set In the revalidation of training set, besides calculating performance measures, the number of compounds out of model AD is taken into account Results in the test set validation are pre-sented in Table3 A consensus model was also developed The models developed by each strategy were compared with the Primal-SVM Unsurprisingly, models based on algo-rithm modifications only showed a slight improvement, while models of the other category are all noticeably better Among the first group, CSVM still performs better than Metacost
on the test set Conversely, subsample, oversample, and
a combination thereof should be appropriately applied for overcoming imbalanced data problems Note that there was
an acceptable number (≤10) of compounds out of the ADs determined by all models Therefore, the possibility of throw-ing out valuable information, which remained our principal concern when applying undersampling techniques, could
be discarded by applying currently proposed workflow An interesting finding was that the variable selection procedure significantly impacted on the predictive capability of classi-fication models, although the cross-validation results of the same algorithm seemed to be similar That is, when sampling methods were applied, the employment of the new variable set selected from new balanced data distribution should not overcome the models that the selected variables raised from the original imbalanced distribution data The 10 variables selected by Primal-SVM showed a surprisingly predictive ability on the test set, especially when an appropriated
Trang 10sam-Table 3 Validation performance of obtained models on test set
a Number of compound out of applicability domain of (training set/test set)
pling strategy was applied Note that, two classes of the test
set display the same unbalanced distribution of the
origi-nal training set Aorigi-nalyzing the prevalence (ratio between TP
and TN) of SMOTE1 and RS1, we observed a highly
bal-anced degree, such as 80.00 %/81.41 % in SMOTE1 and
81.43/82.76 % in RS1, suggesting that these two strategies
are suitable for current unbalanced data
Interestingly, AURC analysis did not give us a clear
con-clusion on the performance improvement from Primal-SVM
(see Tables2and3) There is a little change from the first
model, although the prevalence on the predictions has been
rebalanced A previous study showed that the proportion of
positive to negative instances in the test set does not affect
to the ROC curve [49] Here, on the unbalanced distribution
of test set, the use of AURC for selecting an appropriated
strategy to apply might be misleading The AURC might not
depend on class distribution, but the overlapped nature of
data
Finally, a consensus model was constructed based on
vot-ing mechanism In general, this model slightly outperforms
the best standalone models (SMOTE1 and RS1) with 80.81 %
of G and 73.33 % of F1 As displayed in Table3, this
consen-sus model displayed great advantage over other classifiers
in covering AD Furthermore, using current multiclassifier
could eliminate the concern about choosing the wrong
solu-tion [44,48]
Statistical comparison of classifiers
In order to illustrate the differences between the obtained
models, all performances of nine SVM models in the test
set validation were subjected to multiple comparison
pro-cedures As the first step, the average ranks corresponding
to each classifier were calculated Accordingly, all
clas-sifiers were ranked as follows: RS1 – RS2 – SMOTE1
– CSVM – SS1 – Metacost – SS2 – Primal-SVM (see
Fig 2 Rankings obtained through Friedman test and graphical
rep-resentation of Bonferroni-Dunn procedure considering RS1 as control model Significance levelα of 0.05 and 0.10 expressed as continuous
and dotted lines, respectively
Fig 2) The Friedman’s test null hypothesis was rejected
(p = 0.00) So there is a significant difference between
classifiers obtained The same result was observed by
Iman-Davenport with p < 0.0005 Subsequently, the post-hoc
Bonferroni–Dunn test (atα = 0.05 and 0.10) was applied in
order to reveal which classification model performed equiv-alently to the best-ranked model (RS1) At the end, the
CD values were 3.145 for α = 0.05 and 2.884 for α =
0.10.
In comparison to the lowest bar, which corresponded to the best model (RS1), CSVM, SS1, SMOTE1, and RS2 were considered to have similar performances, since any of them exceed the critical difference (CD) of Bonferroni–Dunn test Additionally, it is possible to identify models that are
significantly worse than the others, viz Primal-SVM and
SS2