Experimental analysis of new algorithms for learning ternary classifiers

The goal of the learning algorithm is to find a vector of weights in {-1, 0, +1}n that minimizes the hinge loss of the linear model from the training data.. A recent paper by the authors

Trang 1

Experimental analysis of new Algorithms for

Learning Ternary Classifiers

Jean-Daniel Zucker IRD, UMI 209, UMMISCO

IRD France Nord, F-93143, Bondy, France,

INSERM, UMR-S 872, les Cordeliers,

Nutriomique (Eq 7), Paris, F-75006 France;

Sorbonne Universit´es, Univ Paris 06, UMI 209

UMMISCO, F-75005, Paris, France;

Equipe MSI, IFI,Vietnam National University,

144 Xuan Thuy, Hanoi, Vietnam;

jean-daniel.zucker@ird.fr

Yann Chevaleyre LIPN, CNRS UMR 7030, Universit´e Paris Nord

93430 Villetaneuse, France chevaleyre@lipn.univ-paris13.fr

Dao Van Sang IFI, Equipe MSI IRD, UMI 209 UMMISCO Vietnam National University Hanoi, Vietnam clairsang@gmail.com

Abstract—Discrete linear classifier is a very sparse class of

decision model that has proved useful to reduce overfitting in very

high dimension learning problems However, learning discrete

linear classifier is known as a difficult problem It requires finding

a discrete linear model minimizing the classification error over

a given sample A ternary classifier is a classifier defined by

a pair (w, r) where w is a vector in {-1, 0, +1}n and r is a

nonnegative real capturing the threshold or offset The goal of

the learning algorithm is to find a vector of weights in {-1, 0,

+1}n that minimizes the hinge loss of the linear model from

the training data This problem is NP-hard and one approach

consists in exactly solving the relaxed continuous problem and

to heuristically derive discrete solutions A recent paper by the

authors has introduced a randomized rounding algorithm [1]

and we propose in this paper more sophisticated algorithms that

improve the generalization error These algorithms are presented

and their performances are experimentally analyzed Our results

show that this kind of compact model can address the complex

problem of learning predictors from bioinformatics data such as

metagenomics ones where the size of samples is much smaller

than the number of attributes The new algorithms presented

improve the state of the art algorithm to learn ternary classifier

The source of power of this improvement is done at the expense

of time complexity

Index Terms—Ternary Classifier, Randomized Rounding,

Metagenomics data

I INTRODUCTION ANDMOTIVATION

Learning classifiers in very high dimension data has

re-ceived more and more attention in the past ten years both

theoretically [2], [3], [4], [5], [6], [7] and in practice through

Data Mining applications [8], [9], [10], [11], [12], [13],

[14], [15], [16], [17] In Biology in particular there is a

recent paradigmatic shift towards predictive tools [18] In the

meantime the Omics (Genomics, Transcriptomics, Protemics,

metabolomics, etc.) data [19], [20], [21] have increased

ex-ponentially and in the future medicine will more and more

rely on the use of such data to provide personalized medicine

[20] When the number of dimensions p is greater than the

number of examples N (p >> N ) the problem of overfitting

Fig 1 A linear classifier vs a ternary classifier The latter correspond to the adding, subtracting or ignoring features to build a decision function.

becomes more and more acute In Omics data in particular the number of dimensions can reach a few millions This is the case for example in metagenomics of the gut microflora were a recently published catalog counts about ten millions entries [22], [23] In such setting, the ratio N/p is so small that both feature selection and sparse learning is required to diminish the risk of overfitting [24] The goal of this research

is to find sparse models that support learning classifiers that scale in high-dimension data such as metagenomics Following

a recent paper by the authors [1], we explore models called ternary classifier They are extremely sparse models simpler than linear combinaison of real weights Such model are represented as a weighted boolean combination (weights are

in {-1, 0, +1}) instead of being in R) greater than a given threshold We explore algorithms that minimizes the hinge loss

of such models induced from the training data and improve the generalization error of the first rounding algorithm proposed

in [1]

II TERNARY CLASSIFIER Let us first introduce more formally the concept of ternary classifier Let us consider an example as a pair (x, y) where x 978-1-4799-8044-4/15/$31.00 c 2015 IEEE

The 2015 IEEE RIVF International Conference on Computing & Communication Technologies

Research, Innovation, and Vision for Future (RIVF)

Trang 2

is an instance in R and y is a label in {-1, +1} A training

set is a collection (xt, yt)t=1of examples A ternary-weighted

linear threshold concept c is a pair (w, r) where r is a real

capturing the threshold or offset and textitw is a vector in{-1,

0, +1}n

A loss function is a mapΘ : {-1, 0, +1}n× R× Rn× {-1,

+1} such that Θ(c; x, y) measures the discrepancy between

the predicted value (c, x) and the true label y For a concept

c and a training set S = (xt, yt)t=1m , the cumulative loss of

c with respect to S is given by L(c; S) = Pm

t=1 Θ(c; xt,

yt) Based on this performance measure, the combinatorial

optimization problem investigated in this study is described

as follows

Loss Minimization over{-1, 0 , +1}n

Given :

1) a target concept class C ⊆ {-1, 0, +1}n

2) a training set S ={(xt, yt)}mt=1

3) a loss functionΘ

4) Find a concept c ∈ C that minimizes L(c, S).

Recall that the zero-one loss is the loss function defined

by Θ01(c; x, y) = 1 if sgn( c, x - r) = y and Θ01(c; x, y)

= 0 otherwise Because this loss function is known to be

hard to optimize, we shall concentrate on the hinge loss a

well-known surrogate of the zero-one loss defined as follows:

Θγ(c; x, y) = (1/γ)max[0, γ - y( w, x - r)] where γ ∈ R+

Finally, let us represent this optimization problem as a

Linear Classification problem:

Linear formulation:

In addition to variables w1, , wn, b, we need m+1 slack

variables ξ1, , ξm The linear formulation is closed to

stan-dard linear SVM formulation, except thatwi’s are bounded:

i∈[m]

ξi

s.t ∀i

wi≤ 1

wi≥ 0

yi(w.xi+ b) ≥ 1 − ξi

ξi≥ 0 Randomized Rounding Method (RR method)

A heuristic approach to solve the optimization problem

described above consists in solving the relaxed problem where

the coefficients are real (in [−1, 1]) and then to perform a

rounding on randomized selected real coefficients Finally

it possible to compute the error rate of the obtained linear

model The basic idea to use the probabilistic method to

convert an optimal solution of a relaxation of a problem into

an approximately optimal solution to the original problem is

standard in computer science and operations research [25] Let

us first recall the simple algorithm called Randomized Round

used to round a real number belonging to[−1, 1] (see Fig 2)

By repeating the rounding algorithm (see Fig 2) a number

of times corresponding to the the number of dimensions, a

model with all its coefficients in the set{-1, 0, +1} is obtained

Algorithm Randomized Rounding INPUT: a real numberα ∈ [−1, 1]

Output: an integer ∈ {−1, 0, 1}

if α >= 0 then Draw randomlyu ∈ {0, 1} such that P r (u = 1) = α returnu

else return −Round(−α) end if

Fig 2 Randomized Rounding Method (RR algorithm) Algorithm Round-k-weights

INPUT: a vector w ∈ [−1, 1]n, an integerk OPTIONAL INPUT: a permutationσ over {1 n} (by default,σ is the identity)

Output: a vector ∈ {−1, 0, 1}n forj = 1 k do

wσ(j)← Round(wσ(j)) end for

returnw Fig 3 Randomized Rounding of k weights (Round − k − weights)

This algorithm which randomly rounds k-weights is called

”Round-k-weights” (see Fig 3)

III NEW ALGORITHMS TO LEARN TERNARY CLASSIFIERS

In this section, we will present three rounding algorithms more elaborated than the initial one provided by [1] and called Round − k − weights:

1) RR-all-weights: this algorithm explores different

ran-domized rounding of all weights and returns the best one (see Fig 4)

2) RRS-k-rand-weights: this algorithm selects k random

weights and then after rounding call again the solver on the problem where the k weights randomized are forced

to their integer value (see Fig 5)

3) RRS-k-best-weights: this algorithm selects k ”best”

weights and then after rounding call again the solver on the problem where the k weights randomized are forced

to their integer value (see Fig 6)

A Rounding over all weights (RR-all-weights)

As described above, we solve the problem in the case of real coefficients in the interval[−1, 1] We thus obtain a vector of weightsw that are real Next, we run M times (this number

M is a meta-parameter of all the algorithms presented, we used M = 100 in our experiments) the Round-k-weights algorithm with k=n (i.e for all weights) The output of the Randomized Rounding algorithm on this vector w, produces a discrete coefficient of vectors in the set{-1, 0, +1} Next, we compute the error rate of each model based on the respective values of the vector Finally, we select the vector of coefficients

w (of M vectors) which gives the best error rate

Trang 3

1: INPUT: A real vector of coefficients w ∈ [-1, +1], a

number of iterationsM

2: OUTPUT: An integer vectorw ∈ {−1, 0, +1}n

minimiz-ing the hminimiz-inge loss

3: List ← {}

4: fort ← 1 M do

5: w ← Round-k-weights(w, k = n)¯

6: addw to List¯

7: end for

8: Computehinge loss of all vectors in List

9: returnbest vector w

Fig 4 Rounding over all weights (RR-all-weights)

1: INPUT: A real vector of coefficients w ∈ [−1, +1] , an

integerk, a number of iterations M

2: OUTPUT: An integer vectorw ∈ {−1, 0, 1}nminimizing

the hinge loss

3: Call Solverto initialize the vector w

4: S ← {1 n}

5: fort ← 1, M do

6: whileS 6= ∅ do

7: LetT be a set of k integers drawn from S without

replacement

9: forj ∈ T do

12: Call Solver again recompute the real coefficients

of the remaining attributes (the one that have not yet been

forced)

13: end while

14: end for

15: returnw - the best among M vectors

Fig 5 Repeat Round and Solve over k random weights (RRS-k-rand-weights)

B Combine Rounding and Solving over k random weights

(RRS-k-rand-weights)

In this algorithm, the linear model solver will be called

many times instead of only once in the above algorithm Let

us first consider the real coefficient vector w obtained from

the linear model First of all, the real coefficients that happen

to be either 0, -1 or 1 will be stored in the object wInt and they

will not be changed in each subsequent step of the algorithm

This set of integer weights wInt may be empty depending on

the problem The coefficients that are not yet integers will be

stored in a set wNInt) which will be processed in the following

way: let us consider first a permutationσ over the set wNInt,

and a parameter k, we realize a loop with i running through

the set {σ(j), j ∈ wNInt}.

In this loop, for eachi taken from 1 to kwNIntk the value of

wNInt[ σ(i)] of w will be rounded by Randomized rounding

(RR) Then, the index of the element wNInt[σ(i)] and the

value RR(wNInt[σ(i)]) will be added to the object wInt Then

1: INPUT: A real vector of coefficients w ∈ [-1, +1], , an integer k, a number of iterations M

2: OUTPUT: An integer vector w ∈ [-1, +1]n minimizing

the hinge loss

3: Call Solverto obtain the vector initiation w 4: Find in w all the coefficients not yet integer, and saved

them in the set wNInt

5: fort ← 1, M do 6: Take a random permutationσ on wNInt

7: fori ← σ(1), σ(kwNIntk) do

8: Findk ”best” attributes for rounding;

9: Force rounding on these best k attributes ; 10: Call Solver again to recalculate the real coeffi-cients of remaining attributes (not yet forced);

11: end for 12: end for 13: return w - the best vector between M

Fig 6 Repeat Round and Solve over k random weights (RRS-k-best-weights)

depending on whetheri is a multiple of k or not the solver will

be called If the index i of the loop is a multiple of k the solver

will be called to find real solutions to the initial linear model

with added conditions that all coefficients: in index[wInt] be

be assigned to their integer values[wInt].

At the end of the loop, all the weights in w[] have been

rounded by RR We can thus compute the value hinge loss This sequence is then repeated with M permutations of different σ(i) The output of the algorithm is a vector of

coefficients that best minimizes the hinge loss (See Fig 5).

C Repeat Round and Solve over Best k Weights (RRS-k-best-weights)

In this third algorithm, instead of choosing sets of k random weights to be rounded before solving again the problem,

we will search for k best coefficients at each step to apply rounding Here, the definition of best coefficient meanst that

their minimum distance to -1, 0 or 1 is the smallest possible The intuition behind this algorithm is that the weights that are the closest to integer values ought to be chosen first as they

have a priori (this is just a heuristic) greater chance to belong

to the optimal solution

IV EXPERIMENTS

We tested the empirical performance of our new algorithms compared to RR-all-weights one by conducting experiments

on several metagenomics databases used as benchmark All algorithms were written in R and the package used to solve linear problems is called lpSolve [26] The performance of lpSolve cannot compete with the CPLEX solver from IBM but

we chose lpSolve because it is open-source and has a smooth integration with R All the algorithms in this paper have been integrated in a R-package called terDA This package will be made available to the community in the near future All tests were executed on a PC Intel Corei5 with 8 GB of RAM The goal of these experiments is to evaluate whether the

Trang 4

different ideas to improve the RR method were improving the

generalization error to learn ternary classifier and at which

cost in terms of CPU

A Metagenomics Datasets

In this paper, we used public Metagenomics data from

the European project METAHIT These are data are

corre-sponding to next generation sequencing of the gut microbiota

The microbiota is a community of microorganisms (viruses,

bacteria, fungi, archea) living in a our gut Two problems

are considered: the first one is to distinguish the status of

patientsobese vs lean based on their microbiota and the

second one is to identify which microbiota signatures supports

predicting wether an individual will have a high vs low gene

count In both case there are epidemiological interests of being

able to devise microbiota signature to predict phenotypes of

patients The data are described in table I

In both cases we consider 172 patients and thousands of

dimension corresponding to counts of particular metaspecies

of the microbiota that have been identified The complete data

include 3.3 millions counts per patient but we have not used

such raw data

B Results

We performed out tests following a 1-fold and 10-fold

cross-validation procedure repeated ten times Here 1-fold

correspond to the case where the data are used both to learn

and test the model, the error measures is thus the empirical

error

Table II shows all the results in terms of empirical error

obtained for the three algorithms In the case of learning

ternary classifier it is interesting to look at the empirical error

because the model is so sparse that it is likely not to have 0%

empirical error

The errors in generalization for all algorithms and the four

datasets (10-fold cross-validation) are given in table III

To quantify the speedup of the algorithms w.r.t to the RR

method, we also explored the impact on the error rate on

learning from DatasetMeta1 varying 2 parameters: the number

of roundings RR and the parameter k The results are shown

on Figure 7 and Figure 8 respectively

Figure 9 shows the evolution in function of k of the

generalization error rate estimated using a 10-fold

cross-validation

V DISCUSSION ANDCONCLUSIONS

The results of Table II shows that RRS-k-rand-weights

and RRS-k-best-weights decrease significantly the empirical

error Nevertheless, it could mean that RR-all-weights does

less overfitting and that it could explain this result Table

III shows that on the fours datasets, RRS-k-rand-weights

and RRS-k-best-weights improve also the generalization error

The improvement is very significant (on dataset

Dataset-Meta1, DatasetMeta2 and DatasetMeta4) and less significant

on DatasetMeta3 It is also clear that RRS-k-best-weights

outperforms RRS-k-rand-weights when N/p is smaller than 1

Fig 7 Empirical error rate vs number of Rounding Here k is set to 5 for RRS-k-rand-weights and k = 2 for RRS-k-Best-weights, number of rounding take values from 1 toM =100 The dataset is DatasetMeta1

Fig 8 Empirical Error rate vs parameter k.

Fig 9 Generalization Error rate vs parameter k in 10-fold cross-validation

which is the case for DatasetMeta3 and DatasetMeta4 The improvements are done at the expense of CPU time as the solver is called several times The time complexity is thus an order of magnitude superior that the RR-all-weights algorithm The figures showing the influence of the number of rounding indicates is not surprising It shows that the empirical error rate stabilizes when M the number of iteration increases.The parameter k has also an impact on the empirical rate As it grows rand-weights becomes worth whereas

Trang 5

RRS-k-TABLE I

M ETAGENOMICS D ATASETS USED RESPECTIVELY WITH N/p GREATER AND LOWER THAN 1 T HE NUMBERS IN PARENTHESIS GIVE THE NUMBER OF

EXAMPLES FOR EACH CLASS

TABLE II

T ABLE OF THE EMPIRICAL ERROR OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS

TABLE III

T ABLE OF THE ERROR IN GENERALIZATION (10- FOLD ) OF EACH ALGORITHM FOR EACH DATASET AND AVERAGED OVER TEN EXECUTIONS

best-weights is less influenced

Overall these experiments suggests that there is room for

improving the original RR-all-weights algorithm that supports

learning ternary classifiers This paper is experimental and

its main result is first to propose two original algorithms to

earn ternary classifiers and second to suggest that

RRS-k-best-weights is the most promising improvement so far made to the

original RR-all-weights algorithm as it gives better empirical

and generalization errors, it is robust w.r.t to various k and M

Future work include theoretical analysis of the algorithms and

thorough analysis of RRS-k-best-weights on other benchmark

data

A Acknowledgments

The authors would liker to express their thanks to Dr Edi

Prifti (INRA, France) for his help in preparing the benchmark

datasets We would like also to express our thanks to the

anonymous reviewers The first author is partially supported

by the METACARDIS project This project is funded by

the European Unions Seventh Framework Programme for

research, technological development and demonstration under

grant agreement HEALT H − F 4 − 2012 − 305312

REFERENCES

[1] Y Chevaleyre, F Koriche, and J.-D Zucker, “Rounding methods for

discrete linear classification,” JMLR WCP, vol 28, no 1, pp 651–659,

2013.

[2] J Fan and R Samworth, “Ultrahigh dimensional feature selection:

beyond the linear model,” The Journal of Machine Learning , 2009 [3] T Hastie and R Tibshirani, The elements of statistical learning: data

mining, inference, and prediction The Mathematical , 2009.

[4] G James, D Witten, T Hastie, and R Tibshirani, An Introduction to

Statistical Learning, ser with Applications in R Springer Science & Business, Jun 2013.

[5] J Kogan, “Introduction to Clustering Large and High-Dimensional Data,” pp 1–222, Dec 2007.

[6] A Kalousis, J Prados, and M Hilario, “Stability of feature selection

algorithms: a study on high-dimensional spaces,” Knowledge and

Infor-mation Systems, vol 12, no 1, pp 95–116, 2007.

[7] L Parsons, E Haque, and H Liu, “Subspace clustering for high

dimensional data: a review,” ACM SIGKDD Explorations Newsletter,

vol 6, no 1, pp 90–105, 2004.

[8] J Oh and J Gao, “A kernel-based approach for detecting outliers of

high-dimensional biological data,” BMC bioinformatics, vol 10, no.

Suppl 4, p S7, 2009.

[9] B Hanczar, J Hua, and E R Dougherty, “Decorrelation of the True and

Estimated Classifier Errors in High-Dimensional Settings,” EURASIP

Journal on Bioinformatics and Systems Biology, vol 2007, pp 1–13, 2007.

[10] S Rojas-Galeano, E Hsieh, D Agranoff, S Krishna, and D Fernandez-Reyes, “Estimation of Relevant Variables on High-Dimensional

Biolog-ical Patterns Using Iterated Weighted Kernel Functions,” PLoS ONE,

vol 3, no 3, p e1806, Mar 2008.

Trang 6

[11] J Quackenbush, “Extracting biology from high-dimensional biological

data,” Journal of Experimental Biology, vol 210, no 9, pp 1507–1517,

May 2007.

[12] L Yu and H Liu, “Feature selection for high-dimensional data:

A fast correlation-based filter solution,” MACHINE

LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, vol 20, no 2,

p 856, 2003.

[13] S Lee, B Schowe, and V Sivakumar, “Feature Selection for High-Dimensional Data with RapidMiner,” 2012.

[14] N Bouguila and D Ziou, “High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based

on Minimum Message Length,” IEEE transactions on pattern analysis

and machine intelligence, vol 29, no 10, pp 1716–1731.

[15] F Petitjean, G I Webb, and A E Nicholson, “Scaling Log-Linear

Anal-ysis to High-Dimensional Data,” in 2013 IEEE International Conference

on Data Mining (ICDM) IEEE, pp 597–606.

[16] A.-C Haury, P Gestraud, and J.-P Vert, “The influence of feature selection methods on accuracy, stability and interpretability of molecular

signatures,” PLoS ONE, vol 6, no 12, p e28210, 12 2011.

[17] R M Simon, J Subramanian, M C Li, and S Menezes, “Using cross-validation to evaluate predictive accuracy of survival risk classifiers

based on high-dimensional data,” Briefings in bioinformatics, vol 12,

no 3, pp 203–214, May 2011.

[18] L Kelley and M Scott, “The evolution of biology A shift towards the engineering of prediction-generating tools and away from traditional

research practice.” EMBO reports, vol 9, no 12, pp 1163–1167, 2008.

[19] J Wooley, A Godzik, and I Friedberg, “A primer on metagenomics,”

PLoS Computational Biology, vol 6, no 2, p e1000667, 2010.

[20] H W Virgin and J A Todd, “Metagenomics and Personalized

Medicine,” Cell, vol 147, no 1, pp 44–56, Sep 2011.

[21] K E Nelson, Metagenomics of the Human Body Springer Science &

Business Media, Nov 2010.

[22] J Li, H Jia, X Cai, H Zhong, et al., P Bork, and J Wang, “An integrated catalog of reference genes in the human gut microbiome,”

Nature biotechnology, vol 32, no 8, pp 834–41, 2014.

[23] J Qin, R Li, J Raes, M Arumugam, K Burgdorf, C Manichanh,

T Nielsen, N Pons, F Levenez, and T Yamada, “A human gut microbial

gene catalogue established by metagenomic sequencing,” Nature, vol.

464, no 7285, pp 59–65, 2010.

[24] A L Tarca, V J Carey, X.-w Chen, R Romero, and S Dr˘aghici,

“Machine learning and its applications to biology,” PLoS Computational

Biology, vol 3, no 6, p e116, 2007.

[25] P Raghavan and C D Thompson, “Randomized rounding: A technique

for probably good algorithms and algorithmic proofs,” Combinatorica,

vol 7, no 4, pp 365–374, 1987.

[26] M Berkelaar, K Eikland, P Notebaert et al., “lpsolve: Open source (mixed-integer) linear programming system,” Eindhoven U of

Technol-ogy, 2004.

Định dạng
Số trang	6
Dung lượng	174,46 KB