DSpace at VNU: GA-SVM: A genetic algorithm for improving gene regulatory activity prediction

In this paper, we introduce a metaheuristic based on genetic algorithm GA to select the best parameters for regulatory prediction from transcriptional factor binding profiles.. In additi

Trang 1

GA SVM: A genetic algorithm for improving

gene regulatory activity prediction

Dong Do Duc∗, Tri-Thanh Le†, Trung-Nghia Vu‡, Huy Q Dinh§, Hoang Xuan Huan¶

,

∗Institute of Information Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi, Vietnam

†Department of Information Technology, Vietnam Maritime University, Hai Phong, Vietnam

‡Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium

§Center for Integrative Bioinformatics, Max F Perutz Laboratories, Vienna, Dr Bohrgasse 9, 1030 Vienna, Austria

and Gregor Mendel of Molecular Plant Biology, Vienna, Austrian Academy of Sciences, Dr Bohrgasse 3, 1030 Vienna, Austria

¶University of Technology (UET), Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi

Email:{dongdoduc, huanhx}@vnu.edu.vn, thanhletri@vimaru.edu.vn, TrungNghia.Vu@ua.ac.be, huy.dinh@univie.ac.at

Abstract—Gene regulatory activity prediction problem is one of

the important steps to understand the significant factors for gene

regulation in biology The advents of recent sequencing

technolo-gies allow us to deal with this task efficiently Amongst these,

Support Vector Machine (SVM) has been applied successfully

up to more than 80% accuracy in the case of predicting gene

regulatory activity in Drosophila embryonic development In this

paper, we introduce a metaheuristic based on genetic algorithm

(GA) to select the best parameters for regulatory prediction from

transcriptional factor binding profiles Our approach helps to

improve more than 10% accuracy compared to the traditional

grid search The improvements are also significantly supported

by biological experimental data Thus, the proposed method

helps boosting not only the prediction performance but also the

potentially biological insights.

I INTRODUCTION& RELATEDWORKS

Since its double helix structure was discovered in 1953,

the DNA (Deoxylribo Nucleic Acid) sequence simply

con-sisting of four letters (Adenine, Cytosine, Guanine, Thymine)

has been considered as the natural blueprint of organism

development Genome itself contains a variety of information

encoded in a long sequence of DNA letters For example, an

interesting information is the gene-regulatory that shapes the

different gene expression patterns Enhancer, or cis-regulatory

module (CRM) is the DNA fragment consisting of the

in-formation to regulate the associated genes It contains the

binding sites for the specific transcriptional factors (TFs)

protein corresponding to a certain regulatory activity So that,

understanding the CRM activity and its requirement is a

fundamental problem in biology [1] Authors in [2] proposed

a simple model of the CRM activity which depends on the

respective TF bindings, i.e either the elimination of TF or the

disruption of its binding leads to the changes of the CRM

function This model has been supported by several

small-scale evidences by ChIP (Chromatin Immunoprecipitation)

experiments after Polymerase Chain Reaction amplification

Recently, one of the first genome-wide scale experiments

[3] was successfully done by using microarray technologies

in the model organism, Drosophila melonagaster This work

used ChIP on the tiling microarray to obtain the first high-resolution atlas of mesodermal cis-regulatory modules The data provided a strong experimental proof for the model mentioned above In addition, they used transcriptional factor binding profiles measured by ChIP signals [4] to predict the expression patterns of genes which are regulated by those respective enhancers Interestingly, the prediction performance was quite well; and more importantly, they predicted some novel enhancers with highly accurate expression categories Thus, learning regulatory code that derives different expression patterns by computational methods is a very attractive branch

in computational biology [1]

To predict the expression patterns of genes, the authors [3] applied a traditional grid search for the parameter op-timization of radial kernel Support Vector Machine (SVM, [5]) and gained up to 82% accuracy under the

leave-one-out cross validation (LOOCV) framework Cost C and γ are

two parameters of radial kernel SVM The former determines the trade-off between the minimization of fitting error and the maximization of classification margin whereas the later affects the efficiency of the kernel function especially for high-dimensional data Parameter optimization plays an im-portant role in the prediction performance of SVM, especially when using radial kernel [6] Metaheuristic approaches (e.g Genetic Algorithm and Ant Colony Optimization) have been successfully applied to optimize the SVM parameters ([6], [7]) in different context problems.The grid search used by the authors in [3] was a quick method that helps to approximate the efficient parameters for SVM prediction However, this method only explored a sparse amount of parameter space

As a consequence, three out of five test cases achieved only 70% of accuracy on average and just one case reached more than80% Especially, those three cases were the situation that the expression pattern of one uniquely corresponds to one enhancer activity Thus, it is necessary to have more intensive methods to further seek the best parameters, particularly for the very strict datasets that the available information might be not enough for the standard prediction

Trang 2

We introduce a genetic algorithm approach to improve the

performance of enhancer activity prediction Making use of

GA, the method search more intensively on the parameter

space than the traditional grid search did, to explore better

parameter for the prediction Consequently, the proposed

ap-proach outperforms the previous method [3] and obtains more

than80% LOOCV accuracy on average for all the cases More

important, our results are significantly better in the case of

predicting the regulatory activity for novel enhancers with in

vivo validated data Our study proved the need of parameter

choosing and optimization in the SVM prediction with the

specific biological dataset

II BIOLOGICAL DATAANDPREDICTION PROBLEM

A Transcriptional binding landscapes in embryonic

Drosophila development

Drosophila is a model organism for embryonic development

research in biology because of the well-established

time-course experiments for several important transcriptional factors

like Twist or Tinman [3] It is also well-known for the very

early time point of the cell development that only DNA

information might be existed It allows us to investigate the

importance of DNA information (e.g DNA motif) with respect

to the developmental regulation of the cell ChIP is a method to

selectively enrich for DNA sequences bound by a particular

protein Recently, this technology was used to identify the

active CRMs systematically by either tiling microarray

(ChIP-chip) or deep sequencing(ChIP-Seq) at whole-genome scale

Using ChIP-chip, [3] used a tiling array to obtain the data of

transcriptional factor binding for five important mesodermal

factors: Twist, Tinman, Mef2, Bagpipe, and Biniou at5 crucial

time points during embryogenesis (Fig 1)

As sequence, each CRM is assigned with one expression

category (mesoderm, somatic muscle, or visceral muscle; (Fig

1) referred as meso, sm, vm from here on in the paper) by using

the well-known database (e.g REDFly database [8]) consisting

of 310 CRMs In this dataset, there are a number of CRMs

belonging to ambiguous expression categories, i.e the patterns

are determined at both meso and sm (called meso sm), or both

vm and sm (called vm sm) In addition, they also identified in

vivo the expression category for35 de novo CRMs which are

unknown from the REDFly database Using transgenic reporter

assay experiments, they also could determine the expression

pattern for those novel CRMs It is very important that one can

test the performance of the prediction approach by predicting

those novel CRMs’ activities using the known REDfly CRMs

in training process

B Spatio-temporal cis-regulatory activity prediction in

ma-chine learning context

Researchers in [3] applied Support Vector Machine to

establish a prediction framework of transcriptional regulatory

activity, i.e expression category, from the binding profiles of

the corresponding transcriptional factor The prediction was

helpful to indicate the potential of determining the specific

Fig 1 Regulatory activity prediction based on the transcriptional binding measured by ChIP-chip heights The peak height indicates the ChIP binding

of the respective TF at specific time point In this figure, Twist (Twi), Tin are

at early time point (5-7h, 8-9h,10-11h), Bin is at late time point (10-11h, 12-13h,13-15h) Whereas Bap is only at 10-11h and Mef2 is at all time points The binding profile is then used to predict the group of the enhancer activity Three groups are mesoderm, somatic muscle, visceral muscle on the right side A part of the figure is from [1]

transcriptional factors and their degrees that influence the ex-pression patterns it regulated In the machine learning context, each CRM was represented by an object data of maximal 15 features which were the combination of transcriptional factors and time points The SVM method was applied to predict the expression pattern of each CRM In details, the binary SVM was used to predict the group of an enhancer corresponding expression of5 transcriptional binding factors at 5 embryonic development time points The groups were mesoderm/somatic muscle/visceral muscle (meso, SM, VM) The combinations, Meso+SM and VM+SM, were also considered because of the natural observation from the expression data

III METHODS

A SVM prediction of regulatory activity based on transcrip-tional factor binding profiles

A SVM constructs an N-dimensional hyperplane that opti-mally discriminates the data into two categories Given an in-dividual enhancer and its corresponding binding profiles from ChIP-chip data, the binary SVM prediction is used to predict its transcriptional category A SVM model is built to learn how

to classify the enhancer x into two classes, e.g mesodermal

or notmesodermal, from a training set of m enhancers which

have known activities The SVM classifier works based on the

following decision function: f(x) =m1 λ i K(x i , x) where K

is a kernel function and λs are coefficients which are learned

during the training process Usually, the linear kernel function

is used for simple data and the radial kernel function is for the more complex cases

SVM is a parameter-sensitive machine learning classifi-cation method, particularly with the radial kernel function Researchers in [3] used fine-grained grid searching to achieve

the optimal result in which C and γ were set as integer values

ranging from10−2 to105and from10−6 to102respectively.

It resulted on average 78% accuracy SVM performance with LOOCV In this paper, we investigate the optimization of two

important parameters: C and γ by using Genetic Algorithm.

GA method will search finer in the parameter spaces, and so better results are expected

Trang 3

B Genetic Algorithm

The GA algorithm works as follow (see pseudo code

Algorithm 1): at t th generation called P (t) consisting of

N solutions or N set of parameters (C, γ) Each solution

is evaluated with a fitness function, here, an AUC value A

next generation (t + 1) th is created by selecting the best

individuals via lottery cycle procedure and GA operators

including mutation or cross-over More details about GA

could be refered to [9] The builds of chromosome and fitness

function of GA for our problem are discussed in the next

section

Algorithm 1: GA algorithm to improve the prediction

Data: An enhancer set with known regulation activity

Output: The best solution

begin

t ← 0 (generation index);

Initialize the generation P (t);

Evaluate P (t);

while termination condition is not met do

t ← t + 1 (next generation);

Select new generation Q (t) from P (t − 1);

Create P (t) from Q(t) by GA operators;

Evaluate P (t) and Select the best individuals;

Output the best solution;

end

The standard implementation with default parameters of GA

algorithm is derived from R package genalg1

C Fitness function and representation of parameters in GA

The main issue of GA is how to present the problem by a

chromosome In our method, two parameters C and γ were

encoded by a chromosome in binary vector In details, each

chromosome consists of a 51-bit binary vector that represents

real values of the parameters The24 first bits are reserved for

the C and the rest represents the value of γ Figure 2 gives an

example of a chromosome, mutation and crossover operations

In the mutation, the bit zero in the dark cell of a chromosome

is changed to the bit one in the result chromosome In the

crossover, two chromosomes are divided at the same postion,

then heads and tails of two chromosomes are exchanged

At each step, the GA algorithm in silico evolves the

popu-lation and selects the best individuals for the next generation

according to the fitness function which is defined as the Area

Under Curve (AUC) value computed by [10] At the last stage,

the best binary vectors are used to transformed back to the

real-valued parameters normalized by a factor of 102 (with

C) and 106(with γ).

IV EXPERIMENTAL RESULTS

A Data & Evaluation

We used two published datasets from the model organism

Drosophila Melanogaster: the first consisted of 310 CRMs

1 http://cran.r-project.org/web/packages/genalg/index.html

Fig 2 51-bit binary representation consists of 24 bits forC and 27 bit for γ.

After a generation, GA operators like mutation and cross-over are performed

to generate a new representation.

with known regulatory activity, the second was a selected collection of 35 novel enhancers whose expression category was tested in vivo from more than 8000 enhancers [3] The

310 enhancers are from the CRM Activity Database (CAD) with the expression driven by published CRMs, using REDFly database [8] For the second set, we used the training set as the first310 known enhancers The novel enhancers were selected

and tested in vivo from [3].

It is worth to note that the majority of datasets were imbalanced, i.e the number of active and non-active enhancers were not equally To evaluate such the type of data, we used

the so-called Balanced Accuracy (BACC) as the average of Sensitivity and Specificity of the prediction results In addition,

we used the traditional Area Under the Curves (AUC) to estimate the trade off between the two measurements All evaluations were computed under the unbiased Leave-One-Out cross validation (LOOCV) context The proposed method were run 20 times and results were recorded Initiation parameter

of GA was default by the genalg package The run time is an

hour in PC 3.3Ghz 4GB RAM, while traditional grid search tooks about 5 minutes in implementation because of its simple strategy However, it is not a significant problem for more and more powerful machine nowadays

B Comparative Study 1) Known enhancer dataset: The GA SVM outperforms

the previous study in all cases of datasets including MESO,

VM, SM and VM SM (Fig 3) In case of Meso SM, the per-formances of two methods are similar and both up to82% It is remarkable to see that the GA SVM significantly improved up

to10% average the performance of SVM prediction for three cases of unique regulatory activity (Meso, VM and SM) The big gap proofs the efficiency of the parameter optimization of SVM for a particular type of data

In the view of AUC, the mean and deviation of run 20 times were recoreded, see the table I The proposed method

Trang 4

Fig 3 The comparison of Balanced Accuracy (BACC) between the

GA SVM method and the grid search (GS SVM) method [3] for five

experimental categories The GA SVM (for 20 runs) outperforms the other

method in all cases.

again has significantly higher performance than the grid search

method in cases of uniquely regulatory activities The ROCR

package [10] is used for the computation

Regulatory category GS SVM[3] GA SVM

Meso SM 0.82 0.83±0.01

VM SM 0.74 0.82±0.02

TABLE I

SEARCH METHOD (GS SVM) [3] IN TERMS OF A REA U NDER THE

C URVES (AUC) FOR ALL EXPERIMENTAL CATEGORIES

2) In vivo enhancer test: In [3], they carried out the in

vivo experiments for35 among more than 8000 new enhancers

and reported its specific regulatory activities In this paper, we

evaluate the performance of the two methods by predicting

these datasets It also considered the so-called partially

cor-rected predictions if the enhancers were predicted one of the

expression categories observed Both methods well-perform

up to approximately 80% of novel CRM regulatory activities

(see Fig 4) Interestingly, the GA SVM improves significantly

number of CRM activity predictions for partially expression It

also helps to decrease number of false positive CRM activity

predictions significantly compared to the previous results [3]

It indicates that the well-suited prediction parameters are

necessary for learning the rules from known CRM datasets

to predict the activity of the novel ones where the training

information might not be really fit the predicting information

V CONCLUSIONS

We proposed a new way to improve the prediction of

gene regulatory activity based on transcriptional factor binding

profiles Our performance was improved roughly more than

10% accuracy compared to the previous method Especially,

we gained the significantly better results in case of unique

Fig 4 The comparison between the GA SVM method with the grid search

method [3] for the novel enhancers True Positive and False Positive indicates

the CRMs with unique regulatory activities where the prediction results are

true/false Partial indicates the number of CRMs that the predicted regulatory

activity is one of the expression categories detected by in vivo experiments.

expression category where the prediction information needs

to be more precise In addition, we also outperformed the prediction in the novel enhancers when using known enhancers

as training set That indicates the importance of optimization in biological prediction The biological data is in emerging time that leads to the needs of optimal computational optimization Future work includes challenging a diversity of prediction problems in biology and then building up an automatic systems

of evolutionary computation algorithms to learn the prediction parameters from the biological data itself

ACKNOWLEDGMENT

This work is partially supported by Vietnams National Foundation for Science and Technology Development (NAFOSTED)

REFERENCES

[1] A Stark, “Learning the transcriptional regulatory code,” Mol Syst Biol.,

vol 5, p 329, 2009.

[2] M I Arnone and E H Davidson, “The hardwiring of development:

organization and function of genomic regulatory systems,” Development,

vol 124, pp 1851–1864, May 1997.

[3] R P Zinzen, C Girardot, J Gagneur, M Braun, and E E Furlong,

“Combinatorial binding predicts spatio-temporal cis-regulatory activity,”

Nature, vol 462, pp 65–70, Nov 2009.

[4] P J Park, “ChIP-seq: advantages and challenges of a maturing

technol-ogy,” Nat Rev Genet., vol 10, pp 669–680, Oct 2009.

[5] C Cortes and V Vapnik, “Support-vector networks,” Machine Learning,

vol 20, pp 273–297, 1995, 10.1007/BF00994018 [Online] Available: http://dx.doi.org/10.1007/BF00994018

[6] X Zhang, X Chen, and Z He, “An aco-based algorithm for

parameter optimization of support vector machines,” Expert Syst.

Appl., vol 37, pp 6618–6628, September 2010 [Online] Available:

http://dx.doi.org/10.1016/j.eswa.2010.03.067 [7] C.-L Huang and C.-J Wang, “A ga-based feature selection and

param-eters optimizationfor support vector machines,” Expert Systems with

Applications, vol 31, no 2, pp 231 – 240, 2006 [Online] Available:

http://www.sciencedirect.com/science/article/pii/S0957417405002083 [8] M S Halfon, S M Gallo, and C M Bergman, “REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor

binding sites in Drosophila,” Nucleic Acids Res., vol 36, pp D594–

598, Jan 2008.

[9] C Reeves, Genetic Algorithms and Combinatorial Optimisation:

Appli-cations of Modern Heuristic Techniques. UK: In V.J Rayward- Smith (Eds), Alfred Waller Ltd, Henley-on-Thames, UK, 1995.

[10] T Sing, O Sander, N Beerenwinkel, and T Lengauer, “ROCR:

visualiz-ing classifier performance in R,” Bioinformatics, vol 21, pp 3940–3941,

Oct 2005.

Định dạng
Số trang	4
Dung lượng	312,83 KB