In this paper, we introduce a metaheuristic based on genetic algorithm GA to select the best parameters for regulatory prediction from transcriptional factor binding profiles.. In additi
Trang 1GA SVM: A genetic algorithm for improving
gene regulatory activity prediction
Dong Do Duc∗, Tri-Thanh Le†, Trung-Nghia Vu‡, Huy Q Dinh§, Hoang Xuan Huan¶
,
∗Institute of Information Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi, Vietnam
†Department of Information Technology, Vietnam Maritime University, Hai Phong, Vietnam
‡Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
§Center for Integrative Bioinformatics, Max F Perutz Laboratories, Vienna, Dr Bohrgasse 9, 1030 Vienna, Austria
and Gregor Mendel of Molecular Plant Biology, Vienna, Austrian Academy of Sciences, Dr Bohrgasse 3, 1030 Vienna, Austria
¶University of Technology (UET), Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi
Email:{dongdoduc, huanhx}@vnu.edu.vn, thanhletri@vimaru.edu.vn, TrungNghia.Vu@ua.ac.be, huy.dinh@univie.ac.at
Abstract—Gene regulatory activity prediction problem is one of
the important steps to understand the significant factors for gene
regulation in biology The advents of recent sequencing
technolo-gies allow us to deal with this task efficiently Amongst these,
Support Vector Machine (SVM) has been applied successfully
up to more than 80% accuracy in the case of predicting gene
regulatory activity in Drosophila embryonic development In this
paper, we introduce a metaheuristic based on genetic algorithm
(GA) to select the best parameters for regulatory prediction from
transcriptional factor binding profiles Our approach helps to
improve more than 10% accuracy compared to the traditional
grid search The improvements are also significantly supported
by biological experimental data Thus, the proposed method
helps boosting not only the prediction performance but also the
potentially biological insights.
I INTRODUCTION& RELATEDWORKS
Since its double helix structure was discovered in 1953,
the DNA (Deoxylribo Nucleic Acid) sequence simply
con-sisting of four letters (Adenine, Cytosine, Guanine, Thymine)
has been considered as the natural blueprint of organism
development Genome itself contains a variety of information
encoded in a long sequence of DNA letters For example, an
interesting information is the gene-regulatory that shapes the
different gene expression patterns Enhancer, or cis-regulatory
module (CRM) is the DNA fragment consisting of the
in-formation to regulate the associated genes It contains the
binding sites for the specific transcriptional factors (TFs)
protein corresponding to a certain regulatory activity So that,
understanding the CRM activity and its requirement is a
fundamental problem in biology [1] Authors in [2] proposed
a simple model of the CRM activity which depends on the
respective TF bindings, i.e either the elimination of TF or the
disruption of its binding leads to the changes of the CRM
function This model has been supported by several
small-scale evidences by ChIP (Chromatin Immunoprecipitation)
experiments after Polymerase Chain Reaction amplification
Recently, one of the first genome-wide scale experiments
[3] was successfully done by using microarray technologies
in the model organism, Drosophila melonagaster This work
used ChIP on the tiling microarray to obtain the first high-resolution atlas of mesodermal cis-regulatory modules The data provided a strong experimental proof for the model mentioned above In addition, they used transcriptional factor binding profiles measured by ChIP signals [4] to predict the expression patterns of genes which are regulated by those respective enhancers Interestingly, the prediction performance was quite well; and more importantly, they predicted some novel enhancers with highly accurate expression categories Thus, learning regulatory code that derives different expression patterns by computational methods is a very attractive branch
in computational biology [1]
To predict the expression patterns of genes, the authors [3] applied a traditional grid search for the parameter op-timization of radial kernel Support Vector Machine (SVM, [5]) and gained up to 82% accuracy under the
leave-one-out cross validation (LOOCV) framework Cost C and γ are
two parameters of radial kernel SVM The former determines the trade-off between the minimization of fitting error and the maximization of classification margin whereas the later affects the efficiency of the kernel function especially for high-dimensional data Parameter optimization plays an im-portant role in the prediction performance of SVM, especially when using radial kernel [6] Metaheuristic approaches (e.g Genetic Algorithm and Ant Colony Optimization) have been successfully applied to optimize the SVM parameters ([6], [7]) in different context problems.The grid search used by the authors in [3] was a quick method that helps to approximate the efficient parameters for SVM prediction However, this method only explored a sparse amount of parameter space
As a consequence, three out of five test cases achieved only 70% of accuracy on average and just one case reached more than80% Especially, those three cases were the situation that the expression pattern of one uniquely corresponds to one enhancer activity Thus, it is necessary to have more intensive methods to further seek the best parameters, particularly for the very strict datasets that the available information might be not enough for the standard prediction
978-1-4673-0309-5/12/$31.00 ©2012 IEEE
Trang 2We introduce a genetic algorithm approach to improve the
performance of enhancer activity prediction Making use of
GA, the method search more intensively on the parameter
space than the traditional grid search did, to explore better
parameter for the prediction Consequently, the proposed
ap-proach outperforms the previous method [3] and obtains more
than80% LOOCV accuracy on average for all the cases More
important, our results are significantly better in the case of
predicting the regulatory activity for novel enhancers with in
vivo validated data Our study proved the need of parameter
choosing and optimization in the SVM prediction with the
specific biological dataset
II BIOLOGICAL DATAANDPREDICTION PROBLEM
A Transcriptional binding landscapes in embryonic
Drosophila development
Drosophila is a model organism for embryonic development
research in biology because of the well-established
time-course experiments for several important transcriptional factors
like Twist or Tinman [3] It is also well-known for the very
early time point of the cell development that only DNA
information might be existed It allows us to investigate the
importance of DNA information (e.g DNA motif) with respect
to the developmental regulation of the cell ChIP is a method to
selectively enrich for DNA sequences bound by a particular
protein Recently, this technology was used to identify the
active CRMs systematically by either tiling microarray
(ChIP-chip) or deep sequencing(ChIP-Seq) at whole-genome scale
Using ChIP-chip, [3] used a tiling array to obtain the data of
transcriptional factor binding for five important mesodermal
factors: Twist, Tinman, Mef2, Bagpipe, and Biniou at5 crucial
time points during embryogenesis (Fig 1)
As sequence, each CRM is assigned with one expression
category (mesoderm, somatic muscle, or visceral muscle; (Fig
1) referred as meso, sm, vm from here on in the paper) by using
the well-known database (e.g REDFly database [8]) consisting
of 310 CRMs In this dataset, there are a number of CRMs
belonging to ambiguous expression categories, i.e the patterns
are determined at both meso and sm (called meso sm), or both
vm and sm (called vm sm) In addition, they also identified in
vivo the expression category for35 de novo CRMs which are
unknown from the REDFly database Using transgenic reporter
assay experiments, they also could determine the expression
pattern for those novel CRMs It is very important that one can
test the performance of the prediction approach by predicting
those novel CRMs’ activities using the known REDfly CRMs
in training process
B Spatio-temporal cis-regulatory activity prediction in
ma-chine learning context
Researchers in [3] applied Support Vector Machine to
establish a prediction framework of transcriptional regulatory
activity, i.e expression category, from the binding profiles of
the corresponding transcriptional factor The prediction was
helpful to indicate the potential of determining the specific
Fig 1 Regulatory activity prediction based on the transcriptional binding measured by ChIP-chip heights The peak height indicates the ChIP binding
of the respective TF at specific time point In this figure, Twist (Twi), Tin are
at early time point (5-7h, 8-9h,10-11h), Bin is at late time point (10-11h, 12-13h,13-15h) Whereas Bap is only at 10-11h and Mef2 is at all time points The binding profile is then used to predict the group of the enhancer activity Three groups are mesoderm, somatic muscle, visceral muscle on the right side A part of the figure is from [1]
transcriptional factors and their degrees that influence the ex-pression patterns it regulated In the machine learning context, each CRM was represented by an object data of maximal 15 features which were the combination of transcriptional factors and time points The SVM method was applied to predict the expression pattern of each CRM In details, the binary SVM was used to predict the group of an enhancer corresponding expression of5 transcriptional binding factors at 5 embryonic development time points The groups were mesoderm/somatic muscle/visceral muscle (meso, SM, VM) The combinations, Meso+SM and VM+SM, were also considered because of the natural observation from the expression data
III METHODS
A SVM prediction of regulatory activity based on transcrip-tional factor binding profiles
A SVM constructs an N-dimensional hyperplane that opti-mally discriminates the data into two categories Given an in-dividual enhancer and its corresponding binding profiles from ChIP-chip data, the binary SVM prediction is used to predict its transcriptional category A SVM model is built to learn how
to classify the enhancer x into two classes, e.g mesodermal
or notmesodermal, from a training set of m enhancers which
have known activities The SVM classifier works based on the
following decision function: f(x) =m1 λ i K(x i , x) where K
is a kernel function and λs are coefficients which are learned
during the training process Usually, the linear kernel function
is used for simple data and the radial kernel function is for the more complex cases
SVM is a parameter-sensitive machine learning classifi-cation method, particularly with the radial kernel function Researchers in [3] used fine-grained grid searching to achieve
the optimal result in which C and γ were set as integer values
ranging from10−2 to105and from10−6 to102respectively.
It resulted on average 78% accuracy SVM performance with LOOCV In this paper, we investigate the optimization of two
important parameters: C and γ by using Genetic Algorithm.
GA method will search finer in the parameter spaces, and so better results are expected
Trang 3B Genetic Algorithm
The GA algorithm works as follow (see pseudo code
Algorithm 1): at t th generation called P (t) consisting of
N solutions or N set of parameters (C, γ) Each solution
is evaluated with a fitness function, here, an AUC value A
next generation (t + 1) th is created by selecting the best
individuals via lottery cycle procedure and GA operators
including mutation or cross-over More details about GA
could be refered to [9] The builds of chromosome and fitness
function of GA for our problem are discussed in the next
section
Algorithm 1: GA algorithm to improve the prediction
Data: An enhancer set with known regulation activity
Output: The best solution
begin
t ← 0 (generation index);
Initialize the generation P (t);
Evaluate P (t);
while termination condition is not met do
t ← t + 1 (next generation);
Select new generation Q (t) from P (t − 1);
Create P (t) from Q(t) by GA operators;
Evaluate P (t) and Select the best individuals;
Output the best solution;
end
The standard implementation with default parameters of GA
algorithm is derived from R package genalg1
C Fitness function and representation of parameters in GA
The main issue of GA is how to present the problem by a
chromosome In our method, two parameters C and γ were
encoded by a chromosome in binary vector In details, each
chromosome consists of a 51-bit binary vector that represents
real values of the parameters The24 first bits are reserved for
the C and the rest represents the value of γ Figure 2 gives an
example of a chromosome, mutation and crossover operations
In the mutation, the bit zero in the dark cell of a chromosome
is changed to the bit one in the result chromosome In the
crossover, two chromosomes are divided at the same postion,
then heads and tails of two chromosomes are exchanged
At each step, the GA algorithm in silico evolves the
popu-lation and selects the best individuals for the next generation
according to the fitness function which is defined as the Area
Under Curve (AUC) value computed by [10] At the last stage,
the best binary vectors are used to transformed back to the
real-valued parameters normalized by a factor of 102 (with
C) and 106(with γ).
IV EXPERIMENTAL RESULTS
A Data & Evaluation
We used two published datasets from the model organism
Drosophila Melanogaster: the first consisted of 310 CRMs
1 http://cran.r-project.org/web/packages/genalg/index.html
Fig 2 51-bit binary representation consists of 24 bits forC and 27 bit for γ.
After a generation, GA operators like mutation and cross-over are performed
to generate a new representation.
with known regulatory activity, the second was a selected collection of 35 novel enhancers whose expression category was tested in vivo from more than 8000 enhancers [3] The
310 enhancers are from the CRM Activity Database (CAD) with the expression driven by published CRMs, using REDFly database [8] For the second set, we used the training set as the first310 known enhancers The novel enhancers were selected
and tested in vivo from [3].
It is worth to note that the majority of datasets were imbalanced, i.e the number of active and non-active enhancers were not equally To evaluate such the type of data, we used
the so-called Balanced Accuracy (BACC) as the average of Sensitivity and Specificity of the prediction results In addition,
we used the traditional Area Under the Curves (AUC) to estimate the trade off between the two measurements All evaluations were computed under the unbiased Leave-One-Out cross validation (LOOCV) context The proposed method were run 20 times and results were recorded Initiation parameter
of GA was default by the genalg package The run time is an
hour in PC 3.3Ghz 4GB RAM, while traditional grid search tooks about 5 minutes in implementation because of its simple strategy However, it is not a significant problem for more and more powerful machine nowadays
B Comparative Study 1) Known enhancer dataset: The GA SVM outperforms
the previous study in all cases of datasets including MESO,
VM, SM and VM SM (Fig 3) In case of Meso SM, the per-formances of two methods are similar and both up to82% It is remarkable to see that the GA SVM significantly improved up
to10% average the performance of SVM prediction for three cases of unique regulatory activity (Meso, VM and SM) The big gap proofs the efficiency of the parameter optimization of SVM for a particular type of data
In the view of AUC, the mean and deviation of run 20 times were recoreded, see the table I The proposed method
Trang 4Fig 3 The comparison of Balanced Accuracy (BACC) between the
GA SVM method and the grid search (GS SVM) method [3] for five
experimental categories The GA SVM (for 20 runs) outperforms the other
method in all cases.
again has significantly higher performance than the grid search
method in cases of uniquely regulatory activities The ROCR
package [10] is used for the computation
Regulatory category GS SVM[3] GA SVM
Meso SM 0.82 0.83±0.01
VM SM 0.74 0.82±0.02
TABLE I
SEARCH METHOD (GS SVM) [3] IN TERMS OF A REA U NDER THE
C URVES (AUC) FOR ALL EXPERIMENTAL CATEGORIES
2) In vivo enhancer test: In [3], they carried out the in
vivo experiments for35 among more than 8000 new enhancers
and reported its specific regulatory activities In this paper, we
evaluate the performance of the two methods by predicting
these datasets It also considered the so-called partially
cor-rected predictions if the enhancers were predicted one of the
expression categories observed Both methods well-perform
up to approximately 80% of novel CRM regulatory activities
(see Fig 4) Interestingly, the GA SVM improves significantly
number of CRM activity predictions for partially expression It
also helps to decrease number of false positive CRM activity
predictions significantly compared to the previous results [3]
It indicates that the well-suited prediction parameters are
necessary for learning the rules from known CRM datasets
to predict the activity of the novel ones where the training
information might not be really fit the predicting information
V CONCLUSIONS
We proposed a new way to improve the prediction of
gene regulatory activity based on transcriptional factor binding
profiles Our performance was improved roughly more than
10% accuracy compared to the previous method Especially,
we gained the significantly better results in case of unique
Fig 4 The comparison between the GA SVM method with the grid search
method [3] for the novel enhancers True Positive and False Positive indicates
the CRMs with unique regulatory activities where the prediction results are
true/false Partial indicates the number of CRMs that the predicted regulatory
activity is one of the expression categories detected by in vivo experiments.
expression category where the prediction information needs
to be more precise In addition, we also outperformed the prediction in the novel enhancers when using known enhancers
as training set That indicates the importance of optimization in biological prediction The biological data is in emerging time that leads to the needs of optimal computational optimization Future work includes challenging a diversity of prediction problems in biology and then building up an automatic systems
of evolutionary computation algorithms to learn the prediction parameters from the biological data itself
ACKNOWLEDGMENT
This work is partially supported by Vietnams National Foundation for Science and Technology Development (NAFOSTED)
REFERENCES
[1] A Stark, “Learning the transcriptional regulatory code,” Mol Syst Biol.,
vol 5, p 329, 2009.
[2] M I Arnone and E H Davidson, “The hardwiring of development:
organization and function of genomic regulatory systems,” Development,
vol 124, pp 1851–1864, May 1997.
[3] R P Zinzen, C Girardot, J Gagneur, M Braun, and E E Furlong,
“Combinatorial binding predicts spatio-temporal cis-regulatory activity,”
Nature, vol 462, pp 65–70, Nov 2009.
[4] P J Park, “ChIP-seq: advantages and challenges of a maturing
technol-ogy,” Nat Rev Genet., vol 10, pp 669–680, Oct 2009.
[5] C Cortes and V Vapnik, “Support-vector networks,” Machine Learning,
vol 20, pp 273–297, 1995, 10.1007/BF00994018 [Online] Available: http://dx.doi.org/10.1007/BF00994018
[6] X Zhang, X Chen, and Z He, “An aco-based algorithm for
parameter optimization of support vector machines,” Expert Syst.
Appl., vol 37, pp 6618–6628, September 2010 [Online] Available:
http://dx.doi.org/10.1016/j.eswa.2010.03.067 [7] C.-L Huang and C.-J Wang, “A ga-based feature selection and
param-eters optimizationfor support vector machines,” Expert Systems with
Applications, vol 31, no 2, pp 231 – 240, 2006 [Online] Available:
http://www.sciencedirect.com/science/article/pii/S0957417405002083 [8] M S Halfon, S M Gallo, and C M Bergman, “REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor
binding sites in Drosophila,” Nucleic Acids Res., vol 36, pp D594–
598, Jan 2008.
[9] C Reeves, Genetic Algorithms and Combinatorial Optimisation:
Appli-cations of Modern Heuristic Techniques. UK: In V.J Rayward- Smith (Eds), Alfred Waller Ltd, Henley-on-Thames, UK, 1995.
[10] T Sing, O Sander, N Beerenwinkel, and T Lengauer, “ROCR:
visualiz-ing classifier performance in R,” Bioinformatics, vol 21, pp 3940–3941,
Oct 2005.