The goal of the learning algorithm is to find a vector of weights in {-1, 0, +1}n that minimizes the hinge loss of the linear model from the training data.. A recent paper by the authors
Trang 1Experimental analysis of new Algorithms for
Learning Ternary Classifiers
Jean-Daniel Zucker IRD, UMI 209, UMMISCO
IRD France Nord, F-93143, Bondy, France,
INSERM, UMR-S 872, les Cordeliers,
Nutriomique (Eq 7), Paris, F-75006 France;
Sorbonne Universit´es, Univ Paris 06, UMI 209
UMMISCO, F-75005, Paris, France;
Equipe MSI, IFI,Vietnam National University,
144 Xuan Thuy, Hanoi, Vietnam;
jean-daniel.zucker@ird.fr
Yann Chevaleyre LIPN, CNRS UMR 7030, Universit´e Paris Nord
93430 Villetaneuse, France chevaleyre@lipn.univ-paris13.fr
Dao Van Sang IFI, Equipe MSI IRD, UMI 209 UMMISCO Vietnam National University Hanoi, Vietnam clairsang@gmail.com
Abstract—Discrete linear classifier is a very sparse class of
decision model that has proved useful to reduce overfitting in very
high dimension learning problems However, learning discrete
linear classifier is known as a difficult problem It requires finding
a discrete linear model minimizing the classification error over
a given sample A ternary classifier is a classifier defined by
a pair (w, r) where w is a vector in {-1, 0, +1}n and r is a
nonnegative real capturing the threshold or offset The goal of
the learning algorithm is to find a vector of weights in {-1, 0,
+1}n that minimizes the hinge loss of the linear model from
the training data This problem is NP-hard and one approach
consists in exactly solving the relaxed continuous problem and
to heuristically derive discrete solutions A recent paper by the
authors has introduced a randomized rounding algorithm [1]
and we propose in this paper more sophisticated algorithms that
improve the generalization error These algorithms are presented
and their performances are experimentally analyzed Our results
show that this kind of compact model can address the complex
problem of learning predictors from bioinformatics data such as
metagenomics ones where the size of samples is much smaller
than the number of attributes The new algorithms presented
improve the state of the art algorithm to learn ternary classifier
The source of power of this improvement is done at the expense
of time complexity
Index Terms—Ternary Classifier, Randomized Rounding,
Metagenomics data
I INTRODUCTION ANDMOTIVATION
Learning classifiers in very high dimension data has
re-ceived more and more attention in the past ten years both
theoretically [2], [3], [4], [5], [6], [7] and in practice through
Data Mining applications [8], [9], [10], [11], [12], [13],
[14], [15], [16], [17] In Biology in particular there is a
recent paradigmatic shift towards predictive tools [18] In the
meantime the Omics (Genomics, Transcriptomics, Protemics,
metabolomics, etc.) data [19], [20], [21] have increased
ex-ponentially and in the future medicine will more and more
rely on the use of such data to provide personalized medicine
[20] When the number of dimensions p is greater than the
number of examples N (p >> N ) the problem of overfitting
Fig 1 A linear classifier vs a ternary classifier The latter correspond to the adding, subtracting or ignoring features to build a decision function.
becomes more and more acute In Omics data in particular the number of dimensions can reach a few millions This is the case for example in metagenomics of the gut microflora were a recently published catalog counts about ten millions entries [22], [23] In such setting, the ratio N/p is so small that both feature selection and sparse learning is required to diminish the risk of overfitting [24] The goal of this research
is to find sparse models that support learning classifiers that scale in high-dimension data such as metagenomics Following
a recent paper by the authors [1], we explore models called ternary classifier They are extremely sparse models simpler than linear combinaison of real weights Such model are represented as a weighted boolean combination (weights are
in {-1, 0, +1}) instead of being in R) greater than a given threshold We explore algorithms that minimizes the hinge loss
of such models induced from the training data and improve the generalization error of the first rounding algorithm proposed
in [1]
II TERNARY CLASSIFIER Let us first introduce more formally the concept of ternary classifier Let us consider an example as a pair (x, y) where x 978-1-4799-8044-4/15/$31.00 c 2015 IEEE
The 2015 IEEE RIVF International Conference on Computing & Communication Technologies
Research, Innovation, and Vision for Future (RIVF)
Trang 2is an instance in R and y is a label in {-1, +1} A training
set is a collection (xt, yt)t=1of examples A ternary-weighted
linear threshold concept c is a pair (w, r) where r is a real
capturing the threshold or offset and textitw is a vector in{-1,
0, +1}n
A loss function is a mapΘ : {-1, 0, +1}n× R× Rn× {-1,
+1} such that Θ(c; x, y) measures the discrepancy between
the predicted value (c, x) and the true label y For a concept
c and a training set S = (xt, yt)t=1m , the cumulative loss of
c with respect to S is given by L(c; S) = Pm
t=1 Θ(c; xt,
yt) Based on this performance measure, the combinatorial
optimization problem investigated in this study is described
as follows
Loss Minimization over{-1, 0 , +1}n
Given :
1) a target concept class C ⊆ {-1, 0, +1}n
2) a training set S ={(xt, yt)}mt=1
3) a loss functionΘ
4) Find a concept c ∈ C that minimizes L(c, S).
Recall that the zero-one loss is the loss function defined
by Θ01(c; x, y) = 1 if sgn( c, x - r) = y and Θ01(c; x, y)
= 0 otherwise Because this loss function is known to be
hard to optimize, we shall concentrate on the hinge loss a
well-known surrogate of the zero-one loss defined as follows:
Θγ(c; x, y) = (1/γ)max[0, γ - y( w, x - r)] where γ ∈ R+
Finally, let us represent this optimization problem as a
Linear Classification problem:
Linear formulation:
In addition to variables w1, , wn, b, we need m+1 slack
variables ξ1, , ξm The linear formulation is closed to
stan-dard linear SVM formulation, except thatwi’s are bounded:
i∈[m]
ξi
s.t ∀i
wi≤ 1
wi≥ 0
yi(w.xi+ b) ≥ 1 − ξi
ξi≥ 0 Randomized Rounding Method (RR method)
A heuristic approach to solve the optimization problem
described above consists in solving the relaxed problem where
the coefficients are real (in [−1, 1]) and then to perform a
rounding on randomized selected real coefficients Finally
it possible to compute the error rate of the obtained linear
model The basic idea to use the probabilistic method to
convert an optimal solution of a relaxation of a problem into
an approximately optimal solution to the original problem is
standard in computer science and operations research [25] Let
us first recall the simple algorithm called Randomized Round
used to round a real number belonging to[−1, 1] (see Fig 2)
By repeating the rounding algorithm (see Fig 2) a number
of times corresponding to the the number of dimensions, a
model with all its coefficients in the set{-1, 0, +1} is obtained
Algorithm Randomized Rounding INPUT: a real numberα ∈ [−1, 1]
Output: an integer ∈ {−1, 0, 1}
if α >= 0 then Draw randomlyu ∈ {0, 1} such that P r (u = 1) = α returnu
else return −Round(−α) end if
Fig 2 Randomized Rounding Method (RR algorithm) Algorithm Round-k-weights
INPUT: a vector w ∈ [−1, 1]n, an integerk OPTIONAL INPUT: a permutationσ over {1 n} (by default,σ is the identity)
Output: a vector ∈ {−1, 0, 1}n forj = 1 k do
wσ(j)← Round(wσ(j)) end for
returnw Fig 3 Randomized Rounding of k weights (Round − k − weights)
This algorithm which randomly rounds k-weights is called
”Round-k-weights” (see Fig 3)
III NEW ALGORITHMS TO LEARN TERNARY CLASSIFIERS
In this section, we will present three rounding algorithms more elaborated than the initial one provided by [1] and called Round − k − weights:
1) RR-all-weights: this algorithm explores different
ran-domized rounding of all weights and returns the best one (see Fig 4)
2) RRS-k-rand-weights: this algorithm selects k random
weights and then after rounding call again the solver on the problem where the k weights randomized are forced
to their integer value (see Fig 5)
3) RRS-k-best-weights: this algorithm selects k ”best”
weights and then after rounding call again the solver on the problem where the k weights randomized are forced
to their integer value (see Fig 6)
A Rounding over all weights (RR-all-weights)
As described above, we solve the problem in the case of real coefficients in the interval[−1, 1] We thus obtain a vector of weightsw that are real Next, we run M times (this number
M is a meta-parameter of all the algorithms presented, we used M = 100 in our experiments) the Round-k-weights algorithm with k=n (i.e for all weights) The output of the Randomized Rounding algorithm on this vector w, produces a discrete coefficient of vectors in the set{-1, 0, +1} Next, we compute the error rate of each model based on the respective values of the vector Finally, we select the vector of coefficients
w (of M vectors) which gives the best error rate
Trang 31: INPUT: A real vector of coefficients w ∈ [-1, +1], a
number of iterationsM
2: OUTPUT: An integer vectorw ∈ {−1, 0, +1}n
minimiz-ing the hminimiz-inge loss
3: List ← {}
4: fort ← 1 M do
5: w ← Round-k-weights(w, k = n)¯
6: addw to List¯
7: end for
8: Computehinge loss of all vectors in List
9: returnbest vector w
Fig 4 Rounding over all weights (RR-all-weights)
1: INPUT: A real vector of coefficients w ∈ [−1, +1] , an
integerk, a number of iterations M
2: OUTPUT: An integer vectorw ∈ {−1, 0, 1}nminimizing
the hinge loss
3: Call Solverto initialize the vector w
4: S ← {1 n}
5: fort ← 1, M do
6: whileS 6= ∅ do
7: LetT be a set of k integers drawn from S without
replacement
9: forj ∈ T do
12: Call Solver again recompute the real coefficients
of the remaining attributes (the one that have not yet been
forced)
13: end while
14: end for
15: returnw - the best among M vectors
Fig 5 Repeat Round and Solve over k random weights (RRS-k-rand-weights)
B Combine Rounding and Solving over k random weights
(RRS-k-rand-weights)
In this algorithm, the linear model solver will be called
many times instead of only once in the above algorithm Let
us first consider the real coefficient vector w obtained from
the linear model First of all, the real coefficients that happen
to be either 0, -1 or 1 will be stored in the object wInt and they
will not be changed in each subsequent step of the algorithm
This set of integer weights wInt may be empty depending on
the problem The coefficients that are not yet integers will be
stored in a set wNInt) which will be processed in the following
way: let us consider first a permutationσ over the set wNInt,
and a parameter k, we realize a loop with i running through
the set {σ(j), j ∈ wNInt}.
In this loop, for eachi taken from 1 to kwNIntk the value of
wNInt[ σ(i)] of w will be rounded by Randomized rounding
(RR) Then, the index of the element wNInt[σ(i)] and the
value RR(wNInt[σ(i)]) will be added to the object wInt Then
1: INPUT: A real vector of coefficients w ∈ [-1, +1], , an integer k, a number of iterations M
2: OUTPUT: An integer vector w ∈ [-1, +1]n minimizing
the hinge loss
3: Call Solverto obtain the vector initiation w 4: Find in w all the coefficients not yet integer, and saved
them in the set wNInt
5: fort ← 1, M do 6: Take a random permutationσ on wNInt
7: fori ← σ(1), σ(kwNIntk) do
8: Findk ”best” attributes for rounding;
9: Force rounding on these best k attributes ; 10: Call Solver again to recalculate the real coeffi-cients of remaining attributes (not yet forced);
11: end for 12: end for 13: return w - the best vector between M
Fig 6 Repeat Round and Solve over k random weights (RRS-k-best-weights)
depending on whetheri is a multiple of k or not the solver will
be called If the index i of the loop is a multiple of k the solver
will be called to find real solutions to the initial linear model
with added conditions that all coefficients: in index[wInt] be
be assigned to their integer values[wInt].
At the end of the loop, all the weights in w[] have been
rounded by RR We can thus compute the value hinge loss This sequence is then repeated with M permutations of different σ(i) The output of the algorithm is a vector of
coefficients that best minimizes the hinge loss (See Fig 5).
C Repeat Round and Solve over Best k Weights (RRS-k-best-weights)
In this third algorithm, instead of choosing sets of k random weights to be rounded before solving again the problem,
we will search for k best coefficients at each step to apply rounding Here, the definition of best coefficient meanst that
their minimum distance to -1, 0 or 1 is the smallest possible The intuition behind this algorithm is that the weights that are the closest to integer values ought to be chosen first as they
have a priori (this is just a heuristic) greater chance to belong
to the optimal solution
IV EXPERIMENTS
We tested the empirical performance of our new algorithms compared to RR-all-weights one by conducting experiments
on several metagenomics databases used as benchmark All algorithms were written in R and the package used to solve linear problems is called lpSolve [26] The performance of lpSolve cannot compete with the CPLEX solver from IBM but
we chose lpSolve because it is open-source and has a smooth integration with R All the algorithms in this paper have been integrated in a R-package called terDA This package will be made available to the community in the near future All tests were executed on a PC Intel Corei5 with 8 GB of RAM The goal of these experiments is to evaluate whether the
Trang 4different ideas to improve the RR method were improving the
generalization error to learn ternary classifier and at which
cost in terms of CPU
A Metagenomics Datasets
In this paper, we used public Metagenomics data from
the European project METAHIT These are data are
corre-sponding to next generation sequencing of the gut microbiota
The microbiota is a community of microorganisms (viruses,
bacteria, fungi, archea) living in a our gut Two problems
are considered: the first one is to distinguish the status of
patientsobese vs lean based on their microbiota and the
second one is to identify which microbiota signatures supports
predicting wether an individual will have a high vs low gene
count In both case there are epidemiological interests of being
able to devise microbiota signature to predict phenotypes of
patients The data are described in table I
In both cases we consider 172 patients and thousands of
dimension corresponding to counts of particular metaspecies
of the microbiota that have been identified The complete data
include 3.3 millions counts per patient but we have not used
such raw data
B Results
We performed out tests following a 1-fold and 10-fold
cross-validation procedure repeated ten times Here 1-fold
correspond to the case where the data are used both to learn
and test the model, the error measures is thus the empirical
error
Table II shows all the results in terms of empirical error
obtained for the three algorithms In the case of learning
ternary classifier it is interesting to look at the empirical error
because the model is so sparse that it is likely not to have 0%
empirical error
The errors in generalization for all algorithms and the four
datasets (10-fold cross-validation) are given in table III
To quantify the speedup of the algorithms w.r.t to the RR
method, we also explored the impact on the error rate on
learning from DatasetMeta1 varying 2 parameters: the number
of roundings RR and the parameter k The results are shown
on Figure 7 and Figure 8 respectively
Figure 9 shows the evolution in function of k of the
generalization error rate estimated using a 10-fold
cross-validation
V DISCUSSION ANDCONCLUSIONS
The results of Table II shows that RRS-k-rand-weights
and RRS-k-best-weights decrease significantly the empirical
error Nevertheless, it could mean that RR-all-weights does
less overfitting and that it could explain this result Table
III shows that on the fours datasets, RRS-k-rand-weights
and RRS-k-best-weights improve also the generalization error
The improvement is very significant (on dataset
Dataset-Meta1, DatasetMeta2 and DatasetMeta4) and less significant
on DatasetMeta3 It is also clear that RRS-k-best-weights
outperforms RRS-k-rand-weights when N/p is smaller than 1
Fig 7 Empirical error rate vs number of Rounding Here k is set to 5 for RRS-k-rand-weights and k = 2 for RRS-k-Best-weights, number of rounding take values from 1 toM =100 The dataset is DatasetMeta1
Fig 8 Empirical Error rate vs parameter k.
Fig 9 Generalization Error rate vs parameter k in 10-fold cross-validation
which is the case for DatasetMeta3 and DatasetMeta4 The improvements are done at the expense of CPU time as the solver is called several times The time complexity is thus an order of magnitude superior that the RR-all-weights algorithm The figures showing the influence of the number of rounding indicates is not surprising It shows that the empirical error rate stabilizes when M the number of iteration increases.The parameter k has also an impact on the empirical rate As it grows rand-weights becomes worth whereas
Trang 5RRS-k-TABLE I
M ETAGENOMICS D ATASETS USED RESPECTIVELY WITH N/p GREATER AND LOWER THAN 1 T HE NUMBERS IN PARENTHESIS GIVE THE NUMBER OF
EXAMPLES FOR EACH CLASS
TABLE II
T ABLE OF THE EMPIRICAL ERROR OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS
TABLE III
T ABLE OF THE ERROR IN GENERALIZATION (10- FOLD ) OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS
best-weights is less influenced
Overall these experiments suggests that there is room for
improving the original RR-all-weights algorithm that supports
learning ternary classifiers This paper is experimental and
its main result is first to propose two original algorithms to
earn ternary classifiers and second to suggest that
RRS-k-best-weights is the most promising improvement so far made to the
original RR-all-weights algorithm as it gives better empirical
and generalization errors, it is robust w.r.t to various k and M
Future work include theoretical analysis of the algorithms and
thorough analysis of RRS-k-best-weights on other benchmark
data
A Acknowledgments
The authors would liker to express their thanks to Dr Edi
Prifti (INRA, France) for his help in preparing the benchmark
datasets We would like also to express our thanks to the
anonymous reviewers The first author is partially supported
by the METACARDIS project This project is funded by
the European Unions Seventh Framework Programme for
research, technological development and demonstration under
grant agreement HEALT H − F 4 − 2012 − 305312
REFERENCES
[1] Y Chevaleyre, F Koriche, and J.-D Zucker, “Rounding methods for
discrete linear classification,” JMLR WCP, vol 28, no 1, pp 651–659,
2013.
[2] J Fan and R Samworth, “Ultrahigh dimensional feature selection:
beyond the linear model,” The Journal of Machine Learning , 2009 [3] T Hastie and R Tibshirani, The elements of statistical learning: data
mining, inference, and prediction The Mathematical , 2009.
[4] G James, D Witten, T Hastie, and R Tibshirani, An Introduction to
Statistical Learning, ser with Applications in R Springer Science & Business, Jun 2013.
[5] J Kogan, “Introduction to Clustering Large and High-Dimensional Data,” pp 1–222, Dec 2007.
[6] A Kalousis, J Prados, and M Hilario, “Stability of feature selection
algorithms: a study on high-dimensional spaces,” Knowledge and
Infor-mation Systems, vol 12, no 1, pp 95–116, 2007.
[7] L Parsons, E Haque, and H Liu, “Subspace clustering for high
dimensional data: a review,” ACM SIGKDD Explorations Newsletter,
vol 6, no 1, pp 90–105, 2004.
[8] J Oh and J Gao, “A kernel-based approach for detecting outliers of
high-dimensional biological data,” BMC bioinformatics, vol 10, no.
Suppl 4, p S7, 2009.
[9] B Hanczar, J Hua, and E R Dougherty, “Decorrelation of the True and
Estimated Classifier Errors in High-Dimensional Settings,” EURASIP
Journal on Bioinformatics and Systems Biology, vol 2007, pp 1–13, 2007.
[10] S Rojas-Galeano, E Hsieh, D Agranoff, S Krishna, and D Fernandez-Reyes, “Estimation of Relevant Variables on High-Dimensional
Biolog-ical Patterns Using Iterated Weighted Kernel Functions,” PLoS ONE,
vol 3, no 3, p e1806, Mar 2008.
Trang 6[11] J Quackenbush, “Extracting biology from high-dimensional biological
data,” Journal of Experimental Biology, vol 210, no 9, pp 1507–1517,
May 2007.
[12] L Yu and H Liu, “Feature selection for high-dimensional data:
A fast correlation-based filter solution,” MACHINE
LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, vol 20, no 2,
p 856, 2003.
[13] S Lee, B Schowe, and V Sivakumar, “Feature Selection for High-Dimensional Data with RapidMiner,” 2012.
[14] N Bouguila and D Ziou, “High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based
on Minimum Message Length,” IEEE transactions on pattern analysis
and machine intelligence, vol 29, no 10, pp 1716–1731.
[15] F Petitjean, G I Webb, and A E Nicholson, “Scaling Log-Linear
Anal-ysis to High-Dimensional Data,” in 2013 IEEE International Conference
on Data Mining (ICDM) IEEE, pp 597–606.
[16] A.-C Haury, P Gestraud, and J.-P Vert, “The influence of feature selection methods on accuracy, stability and interpretability of molecular
signatures,” PLoS ONE, vol 6, no 12, p e28210, 12 2011.
[17] R M Simon, J Subramanian, M C Li, and S Menezes, “Using cross-validation to evaluate predictive accuracy of survival risk classifiers
based on high-dimensional data,” Briefings in bioinformatics, vol 12,
no 3, pp 203–214, May 2011.
[18] L Kelley and M Scott, “The evolution of biology A shift towards the engineering of prediction-generating tools and away from traditional
research practice.” EMBO reports, vol 9, no 12, pp 1163–1167, 2008.
[19] J Wooley, A Godzik, and I Friedberg, “A primer on metagenomics,”
PLoS Computational Biology, vol 6, no 2, p e1000667, 2010.
[20] H W Virgin and J A Todd, “Metagenomics and Personalized
Medicine,” Cell, vol 147, no 1, pp 44–56, Sep 2011.
[21] K E Nelson, Metagenomics of the Human Body Springer Science &
Business Media, Nov 2010.
[22] J Li, H Jia, X Cai, H Zhong, et al., P Bork, and J Wang, “An integrated catalog of reference genes in the human gut microbiome,”
Nature biotechnology, vol 32, no 8, pp 834–41, 2014.
[23] J Qin, R Li, J Raes, M Arumugam, K Burgdorf, C Manichanh,
T Nielsen, N Pons, F Levenez, and T Yamada, “A human gut microbial
gene catalogue established by metagenomic sequencing,” Nature, vol.
464, no 7285, pp 59–65, 2010.
[24] A L Tarca, V J Carey, X.-w Chen, R Romero, and S Dr˘aghici,
“Machine learning and its applications to biology,” PLoS Computational
Biology, vol 3, no 6, p e116, 2007.
[25] P Raghavan and C D Thompson, “Randomized rounding: A technique
for probably good algorithms and algorithmic proofs,” Combinatorica,
vol 7, no 4, pp 365–374, 1987.
[26] M Berkelaar, K Eikland, P Notebaert et al., “lpsolve: Open source (mixed-integer) linear programming system,” Eindhoven U of
Technol-ogy, 2004.