Series EditorsSorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los An
Trang 2Lecture Notes in Bioinformatics 3909 Edited by S Istrail, P Pevzner, and M Waterman
Editorial Board: A Apostolico S Brunak M Gelfand
T Lengauer S Miyano G Myers M.-F Sagot D Sankoff
R Shamir T Speed M Vingron W Wong
Subseries of Lecture Notes in Computer Science
Trang 4Alberto Apostolico Concettina Guerra
Sorin Istrail Pavel Pevzner
Michael Waterman (Eds.)
Trang 5Series Editors
Sorin Istrail, Brown University, Providence, RI, USA
Pavel Pevzner, University of California, San Diego, CA, USA
Michael Waterman, University of Southern California, Los Angeles, CA, USAVolume Editors
Alberto Apostolico
Concettina Guerra
University of Padova, Department of Information Engineering
Via Gradenigo 6/a, 35131 Padova, Italy
E-mail: {axa, guerra}@dei.unipd.it
Sorin Istrail
Brown University, Center for Molecular Biology and Computer Science Department
115 Waterman St., Providence, RI 02912, USA
E-mail: sorin@cs.brown.edu
Pavel Pevzner
University of California at San Diego
Department of Computer Science and Engineering
La Jolla, CA 92093-0114, USA
E-mail: ppevzner@cs.ucsd.edu
Michael Waterman
University of Southern California
Department of Molecular and Computational Biology
1050 Childs Way, Los Angeles, CA 90089-2910, USA
E-mail: msw@usc.edu
Library of Congress Control Number: 2006922626
CR Subject Classification (1998): F.2.2, F.2, E.1, G.2, H.2.8, G.3, I.2, J.3
LNCS Sublibrary: SL 8 – Bioinformatics
ISBN-10 3-540-33295-2 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-33295-4 Springer Berlin Heidelberg New York
This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
Trang 6This volume contains the papers presented at the 10th Annual InternationalConference on Research in Computational Molecular Biology (RECOMB 2006),which was held in Venice, Italy, on April 2–5, 2006 The RECOMB conferenceseries was started in 1997 by Sorin Istrail, Pavel Pevzner and Michael Waterman.The table on p VIII summarizes the history of the meetings RECOMB 2006was hosted by the University of Padova at the Cinema Palace of the VeniceConvention Center, Venice Lido, Italy It was organized by a committee chaired
by Concettina Guerra A special 10th Anniversary Program Committee wasformed, by including the members of the Steering Committee and inviting allChairs of past editions The Program Committee consisted of the 38 memberswhose names are listed on a separate page
From 212 submissions of high quality, 40 papers were selected for presentation
at the meeting, and they appear in these proceedings The selection was based onreviews and evaluations produced by the Program Committee members as well as
by external reviewers, and on a subsequent Web-based PC open forum Followingthe decision made in 2005 by the Steering Committee, RECOMB Proceedings are
published as a volume of Lecture Notes in Bioinformatics (LNBI), which is edited by the founders of RECOMB Traditionally, the Journal of Computational Biology devotes a special issue to the publication of archival versions of selected
co-conference papers
RECOMB 2006 featured seven keynote addresses by as many invited ers: Anne-Claude Gavin (EMBL, Heidelberg, Germany), David Haussler (Uni-versity of California, Santa Cruz, USA), Ajay K Royyuru (IBM T.J WatsonResearch Center, USA), David Sankoff (University of Ottawa, Canada), Michael
speak-S Waterman (University of Southern California, USA), Carl Zimmer (ScienceWriter, USA), Roman A Zubarev (Uppsala University, Sweden) The StanislawUlam Memorial Computational Biology Lecture was given by Michael S Water-man A special feature presentation was devoted to the 10th anniversary and isincluded in this volume
Like in the past, an important ingredient for the success of the meeting wasrepresented by a lively poster session
RECOMB06 was made possible by the hard work and dedication of many,from the Steering to the Program and Organizing Committees, from the externalreviewers, to Venice Convention, Venezia Congressi and the institutions andcorporations who provided administrative, logistic and financial support for theconference The latter include the Department of Information Engineering ofthe University of Padova, the Broad Institute of MIT and Harvard (USA), theCollege of Computing of Georgia Tech (USA), the US Department of Energy,IBM Corporation (USA), the International Society for Computational Biology
Trang 7VI Preface
(ISCB), the Italian Association for Informatics and Automatic Computation(AICA), the US National Science Foundation, and the University of Padova.Special thanks are due to all those who submitted their papers and postersand who attended RECOMB 2006 with enthusiasm
RECOMB 2006 Program Chair
Trang 8Program Committee
Tatsuya Akutsu (Kyoto University, Japan)
Alberto Apostolico Chair (Accademia Nazionale Dei Lincei, Italy,
and Georgia Tech., USA)Gary Benson (Boston University, USA)
Mathieu Blanchette (McGill, Canada)
Philip E Bourne (University of California San Diego, USA)
Steve Bryant (NCBI, USA)
Andrea Califano (Columbia University, USA)
Andy Clark (Cornell University, USA)
Gordon M Crippen (University of Michigan, USA)
Raffaele Giancarlo (University of Palermo, Italy)
Concettina Guerra (University of Padova, Italy, and Georgia Tech., USA)Dan Gusfield (University of California, Davis, USA)
Sridhar Hannenhalli (University of Pennsylvania, USA)
Sorin Istrail (Brown University, USA)
Inge Jonassen (University of Bergen, Norway)
Richard M Karp (University of California, Berkeley, USA)
Simon Kasif (Boston University, USA)
Manolis Kellis (MIT, USA)
Giuseppe Lancia (University of Udine, Italy)
Thomas Lengauer (GMD Sant Augustin, Germany)
Michael Levitt (Stanford, USA)
Michal Linial (The Hebrew University in Jerusalem, Israel)
Jill Mesirov (Broad Institute of MIT and Harvard, USA)
Satoru Miyano (University of Tokyo, Japan)
Laxmi Parida (IBM T.J Watson Research Center, USA)
Pavel A Pevzner (University of California San Diego, USA)
Marie-France Sagot (INRIA Rhone-Alpes, France)
David Sankoff (University of Ottawa, Canada)
Ron Shamir (Tel Aviv University, Israel)
Roded Sharan (Tel Aviv University, Israel)
Steve Skiena (State University of New York at Stony Brook, USA)Terry Speed (University of California, Berkeley, USA)
Jens Stoye (University of Bielefeld, Germany)
Esko Ukkonen (University of Helsinki, Finland)
Martin Vingron (Max Planck Institute for Molecular Genetics,
Germany)Michael Waterman (University of Southern California, USA)
Haim J Wolfson (Tel Aviv University, Israel)
Trang 9VIII Organization
Steering Committee
Sorin Istrail RECOMB General Vice-chair (Brown, USA)
Thomas Lengauer (GMD Sant Augustin, Germany)
Michal Linial (The Hebrew University of Jerusalem, Israel)
Pavel A Pevzner RECOMB General Chair (University of
California, San Diego, USA)Ron Shamir (Tel Aviv University, Israel)
Terence P Speed (University of California, Berkeley, USA)
Michael Waterman RECOMB General Chair (University of Southern
California, USA)
Organizing Committee
Alberto Apostolico (Accademia Nazionale dei Lincei, Italy,
and Georgia Tech., USA)Concettina Guerra Conference Chair (University of Padova, Italy,
and Georgia Tech., USA)Eleazar Eskin Chair, 10th Anniversary Committee (University of
California, San Diego)Matteo Comin (University of Padova, Italy)
Raffaele Giancarlo (University of Palermo, Italy)
Giuseppe Lancia (University of Udine, Italy)
Cinzia Pizzi (University of Padova, Italy, and Univ of Helsinki,
Finland)Angela Visco (University of Padova, Italy)
Nicola Vitacolonna (University of Udine, Italy)
Previous RECOMB Meetings
Date/Location Hosting Institution Program Chair Conference Chair
January 20-23, 1997
Sandia National Lab Michael Waterman Sorin Istrail Santa Fe, NM, USA
March 22-25, 1998 Mt Sinai School
Pavel Pevzner Gary Benson New York, NY, USA of Medicine
April 10-13, 2003 German Federal Ministry
Webb Miller Martin Vingron Berlin, Germany for Education & Research
March 27-31, 2004
UC San Diego Dan Gusfield Philip E Bourne San Diego, USA
May 14-18, 2005 Broad Institute of
Satoru Miyano Jill P Mesirov and S Kasif
Trang 10Organization IX
The RECOMB06 Program Committee gratefully
acknowledges the valuable input received from the
following external reviewers:
David Fernandez-BacaVladimir FilkovSarel FleishmanKristian FlikkaMenachem FormerIddo FriedbergMenachem FrumanIrit Gat-ViksGad GetzApostol GramadaAlex GraySteffen GrossmannJenny Gu
Roderic GuigoMatthew HahnYonit HalperinTzvika HartmanChristoph HartmannNurit Haspel
Greg HatherMorihiro HayashidaTrond Hellem Bø
D HermelinKatsuhisa HorimotoMoseley HunterSeiya ImotoYuval InbarNathan IntratorDavid JaffeMartin JambonShane JensenEuna JeongTao Jiang
Juha K¨arkk¨ainenHans-MichaelKaltenbachSimon KasifKlara KedemAlon KeinanWayne KendalIlona KiferGad KimmelJyrki KivinenMikko KoivistoRachel KolodnyVered KunikVincent LacroixQuan LeSoo LeeCeline LefebvreHadas LeonovJie LiangChaim LinhartZsuzsanna Lipt´akManway LiuAniv LoewensteinClaudio LottazClaus LundegaardHannes LuzAaron MackeyKetil MaldeKartik ManiThomas MankeYishay MansourAdam MargolinFlorian MarkowetzSetsuro MatsudaAlice McHardyKevin MirandaLeonid MirnyStefano MontiSayan MukherjeeIftach NachmanMasao Nagasaki
Trang 11Stefanie ScheidAlexander SchliepDina SchneidmanRussell SchwartzPaolo SerafiniMaxim ShatskyFeng ShengzhongTetsuo ShibuyaIlya ShindyalovTomer Shlomi
A Shulman-PelegAbdur SikderGordon SmythYun SongRainer SpangMike SteelIsrael SteinfeldChristine SteinhoffKristian StevensAravind SubramanianFengzhu Sun
Christina Sunita LeslieEdward Susko
Yoshinori TamadaAmos TanayHaixu TangEric TannierElisabeth TillierWiebke TimmAristotelis Tsirigos
Nobuhisa UedaIgor UlitskySandor VajdaRoy VarshavskyBalaji VenkatachalamStella VeretnikDennis VitkupYoshiko WakabayashiJianyong WangJunwen WangKai WangLi-San WangLusheng WangTandy WarnowArieh WarshelDavid WildVirgil WoodsTerrence WuYufeng WuLei XieChen XinEric XingZohar YakhiniNir YosefRyo YoshidaJohn ZhangLouxin ZhangDegui ZhiXianghong J ZhouJoseph Ziv BarMichal Ziv-Ukelson
Trang 12RECOMB Tenth Anniversary Venue: il Palazzo
Del Cinema del Lido di Venezia
Trang 13Sponsors
Trang 14Table of Contents
Integrated Protein Interaction Networks for 11 Microbes
Balaji S Srinivasan, Antal F Novak, Jason A Flannick,
Serafim Batzoglou, Harley H McAdams 1Hypergraph Model of Multi-residue Interactions in Proteins:
Sequentially–Constrained Partitioning Algorithms for Optimization of
Site-Directed Protein Recombination
Xiaoduan Ye, Alan M Friedman, Chris Bailey-Kellogg 15Biological Networks: Comparison, Conservation, and Evolutionary
Ling Wang, Marco Ramoni, Paola Sebastiani 60
A Patient-Gene Model for Temporal Expression Profiles in Clinical
Studies
Naftali Kaminski, Ziv Bar-Joseph 69Global Interaction Networks Probed by Mass Spectrometry (Keynote)
Anne-Claude Gavin 83Statistical Evaluation of Genome Rearrangement (Keynote)
David Sankoff 84
An Improved Statistic for Detecting Over-Represented Gene Ontology
Annotations in Gene Sets
Steffen Grossmann, Sebastian Bauer, Peter N Robinson,
Martin Vingron 85Protein Function Annotation Based on Ortholog Clusters Extracted
from Incomplete Genomes Using Combinatorial Optimization
Akshay Vashist, Casimir Kulikowski, Ilya Muchnik 99Detecting MicroRNA Targets by Linking Sequence, MicroRNA and
Gene Expression Data
Jim C Huang, Quaid D Morris, Brendan J Frey 114
Trang 15XIV Table of Contents
RNA Secondary Structure Prediction Via Energy Density Minimization
Can Alkan, Emre Karakoc, S Cenk Sahinalp,
Peter Unrau, H Alexander Ebhardt, Kaizhong Zhang,
Jeremy Buhler 130Structural Alignment of Pseudoknotted RNA
Banu Dost, Buhm Han, Shaojie Zhang, Vineet Bafna 143Stan Ulam and Computational Biology (Keynote)
Michael S Waterman 159CONTRAlign: Discriminative Training for Protein Sequence Alignment
Chuong B Do, Samuel S Gross, Serafim Batzoglou 160Clustering Near-Identical Sequences for Fast Homology Search
Michael Cameron, Yaniv Bernstein, Hugh E Williams 175New Methods for Detecting Lineage-Specific Selection
Adam Siepel, Katherine S Pollard, David Haussler 190
A Probabilistic Model for Gene Content Evolution with Duplication,
Loss, and Horizontal Transfer
Mikl´ os Cs˝ ur¨ os, Istv´ an Mikl´ os 206
A Sublinear-Time Randomized Approximation Scheme for the
Ajay K Royyuru, Gabriela Alexe, Daniel Platt, Ravi Vijaya-Satya,
Laxmi Parida, Saharon Rosset, Gyan Bhanot 246Efficient Enumeration of Phylogenetically Informative Substrings
Stanislav Angelov, Boulos Harb, Sampath Kannan, Sanjeev Khanna,
Junhyong Kim 248Phylogenetic Profiling of Insertions and Deletions in Vertebrate
Genomes
Sagi Snir, Lior Pachter 265
Trang 16Table of Contents XV
Maximal Accurate Forests from Distance Matrices
Constantinos Daskalakis, Cameron Hill, Alexandar Jaffe,
Radu Mihaescu, Elehanan Mossel, Satish Rao 281Leveraging Information Across HLA Alleles/Supertypes Improves
Epitope Prediction
David Heckerman, Carl Kadie, Jennifer Listgarten 296Improving Prediction of Zinc Binding Sites by Modeling the Linkage
Between Residues Close in Sequence
Sauro Menchetti, Andrea Passerini, Paolo Frasconi,
Claudia Andreini, Antonio Rosato 309
An Important Connection Between Network Motifs and Parsimony
Models
Teresa M Przytycka 321Ultraconserved Elements, Living Fossil Transposons, and Rapid Bursts
of Change: Reconstructing the Uneven Evolutionary History of the
Human Genome (Keynote)
David Haussler 336Permutation Filtering: A Novel Concept for Significance Analysis of
Large-Scale Genomic Data
Stefanie Scheid, Rainer Spang 338Genome-Wide Discovery of Modulators of Transcriptional Interactions
in Human B Lymphocytes
Kai Wang, Ilya Nemenman, Nilanjana Banerjee, Adam A Margolin,
Andrea Califano 348
A New Approach to Protein Identification
Nuno Bandeira, Dekel Tsur, Ari Frank, Pavel Pevzner 363Markov Methods for Hierarchical Coarse-Graining of Large Protein
Stochastic Roadmap Simulation
Tsung-Han Chiang, Mehmet Serkan Apaydin, Douglas L Brutlag,
David Hsu, Jean-Claude Latombe 410
Trang 17XVI Table of Contents
An Outsider’s View of the Genome (Keynote)
Carl Zimmer 425Alignment Statistics for Long-Range Correlated Genomic Sequences
Philipp W Messer, Ralf Bundschuh, Martin Vingron,
Peter F Arndt 426Simple and Fast Inverse Alignment
John Kececioglu, Eagu Kim 441Revealing the Proteome Complexity by Mass Spectrometry (Keynote)
Roman A Zubarev 456Motif Yggdrasil: Sampling from a Tree Mixture Model
Samuel A Andersson, Jens Lagergren 458
A Study of Accessible Motifs and RNA Folding Complexity
Ydo Wexler, Chaya Zilberstein, Michal Ziv-Ukelson 473
A Parameterized Algorithm for Protein Structure Alignment
Jinbo Xu, Feng Jiao, Bonnie Berger 488Geometric Sieving: Automated Distributed Optimization of 3D Motifs
for Protein Function Prediction
Brian Y Chen, Viacheslav Y Fofanov,
Drew H Bryant, Bradley D Dodson,
David M Kristensen, Andreas M Lisewski, Marek Kimmel,
Olivier Lichtarge, Lydia E Kavraki 500
A Branch-and-Reduce Algorithm for the Contact Map Overlap Problem
Wei Xie, Nikolaos V Sahinidis 516
A Novel Minimized Dead-End Elimination Criterion and Its Application
to Protein Redesign in a Hybrid Scoring and Search Algorithm for
Computing Partition Functions over Molecular Ensembles
Ivelin Georgiev, Ryan H Lilien,
Bruce R Donald 530
10 Years of the International Conference on Research in Computational
Molecular Biology (RECOMB) (Keynote)
Sarah J Aerni, Eleazar Eskin 546Sorting by Weighted Reversals, Transpositions, and Inverted
Transpositions
Martin Bader, Enno Ohlebusch 563
Trang 18Table of Contents XVII
A Parsimony Approach to Genome-Wide Ortholog Assignment
Zheng Fu, Xin Chen, Vladimir Vacic, Peng Nan, Yang Zhong,
Tao Jiang 578Detecting the Dependent Evolution of Biosequences
Jeremy Darot, Chen-Hsiang Yeang, David Haussler 595
Author Index 611
Trang 19Integrated Protein Interaction Networks
for 11 Microbes
Balaji S Srinivasan1,2, Antal F Novak3, Jason A Flannick3,
Serafim Batzoglou3, and Harley H McAdams2
1 Department of Electrical Engineering2
Department of Developmental Biology3
Department of Computer Science, Stanford University,
Stanford, CA 94305, USA
Abstract We have combined four different types of functional genomic
data to create high coverage protein interaction networks for 11 crobes Our integration algorithm naturally handles statistically depen-dent predictors and automatically corrects for differing noise levels anddata corruption in different evidence sources We find that many of thepredictions in each integrated network hinge on moderate but consis-tent evidence from multiple sources rather than strong evidence from asingle source, yielding novel biology which would be missed if a singledata source such as coexpression or coinheritance was used in isolation
mi-In addition to statistical analysis, we demonstrate via case study thatthese subtle interactions can discover new aspects of even well studiedfunctional modules Our work represents the largest collection of proba-bilistic protein interaction networks compiled to date, and our methodscan be applied to any sequenced organism and any kind of experimental
or computational technique which produces pairwise measures of proteininteraction
Interaction networks are the canonical data sets of the post-genomic era, andmore than a dozen methods to detect protein-DNA and protein-protein interac-tions on a genomic scale have been recently described [1, 2, 3, 4, 5, 6, 7, 8, 9] Asmany of these methods require no further experimental data beyond a genomesequence, we now have a situation in which a number of different interaction net-works are available for each sequenced organism However, though many of theseinteraction predictors have been individually shown to predict experiment[6], thenetworks generated by each method are often contradictory and not superpos-able in any obvious way [10, 11] This seeming paradox has stimulated a burst
of recent work on the problem of network integration, work which has primarily
focused on Saccharomyces cerevisiae[12, 13, 14, 15, 16, 17] While the profusion
of experimental network data [18] in yeast makes this focus understandable, theobjective of network integration remains general: namely, a summary network
A Apostolico et al (Eds.): RECOMB 2006, LNBI 3909, pp 1–14, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Trang 202 B.S Srinivasan et al.
for each species which uses all the evidence at hand to predict which proteinsare functionally linked
In the ideal case, an algorithm to generate such a network should be able to:
1 Integrate evidence sets of various types (real valued, ordinal scale, ical, and so on) and from diverse sources (expression, phylogenetic profiles,chromosomal location, two hybrid, etc.)
categor-2 Incorporate known prior information (such as individually confirmed tional linkages), again of various types
func-3 Cope with statistical dependencies in the evidence set (such as multiple etitions of the same expression time course) and noisy or corrupted evidence
rep-4 Provide a decomposition which indicates the evidence variables which weremost informative in determining a given linkage prediction
5 Produce a unified probabilistic assessment of linkage confidence given all theobserved evidence
In this paper we present an algorithm for network integration that satisfiesall five of these requirements We have applied this algorithm to integrate fourdifferent kinds of evidence (coexpression[3], coinheritance[5], colocation[1], andcoevolution[9]) to build probabilistic interaction networks for 11 sequenced mi-crobes The resulting networks are undirected graphs in which nodes correspond
to proteins and edge weights represent interaction probabilities between proteinpairs Protein pairs with high interaction probabilities are not necessarily in di-rect contact, but are likely to participate in the same functional module [19],such as a metabolic pathway, a signaling network, or a multiprotein complex
We demonstrate the utility of network integration for the working biologist byanalyzing representative functional modules from two microbes: the eukaryote-
like glycosylation system of Campylobacter jejuni NCTC 11168 and the cell division machinery of Caulobacter crescentus For each module, we show that a
subset of the interactions predicted by our network recapitulate those described
in the literature Importantly, we find that many of the novel interactions inthese modules originate in moderate evidence from multiple sources rather thanstrong evidence from a single source, representing hidden biology which would
be missed if a single data type was used in isolation
2.1 Algorithm Overview
The purpose of network integration is to systematically combine different types
of data to arrive at a statistical summary of which proteins work together within
a single organism
For each of the 11 organisms listed in the Appendix1we begin by assembling
a training set of known functional modules (Figure 1a) and a battery of differentpredictors (Figure 1b) of functional association To gain intuition for what our
1 Viewable at http://jinome.stanford.edu/pdfs/recomb06182 appendix.pdf
Trang 21Integrated Protein Interaction Networks for 11 Microbes 3
algorithm does, consider a single predictor E defined on a pair of proteins, such
as the familiar Pearson correlation between expression vectors Also consider a
variable L, likewise defined on pairs of proteins, which takes on three possible
values: ‘1’ when two proteins are in the same functional category, ‘0’ when theyare known to be in different categories, and ‘?’ when one or both of the proteins
is of unknown function
We note first that two proteins known to be in the same functional module aremore likely to exhibit high levels of coexpression than two proteins known to be
in different modules, indicated graphically by a right-shift in the distribution of
P (E |L = 1) relative to P (E|L = 0) (Figure 1b) We can invert this observation
via Bayes’ rule to obtain the probability that two proteins are in the same
functional module as a function of the coexpression, P (L = 1 |E) This posterior
probability increases with the level of coexpression, as highly coexpressed pairsare more likely to participate in the same functional module
If we apply this approach to each candidate predictor in turn, we can obtainvaluable information about the extent to which each evidence type recapitulatesknown functional linkages – or, more precisely, the efficiency with which each
predictor classifies pairs of proteins into the “linked” or “unlinked” categories.
Importantly, benchmarking each predictor in terms of its performance as a binaryclassifier provides a way to compare previously incomparable data sets, such asmatrices[6] of BLAST[20] bit scores and arrays of Cy5/Cy3 ratios[3] Even moreimportantly, it suggests that the problem of network integration can be viewed
as a high dimensional binary classifier problem By generalizing the approach
outlined above to the case where E is a vector rather than a scalar, we can
calculate the summary probability that two proteins are functionally linked givenall the evidence at hand
2.2 Training Set and Evidence Calculation
It is difficult to say a priori which predictors of functional association will be
the best for a given organism For example, microarray quality is known tovary widely, so coexpression correlations in different organisms are not directlycomparable Thus, to calibrate our interaction prediction algorithm, we require
a training set of known interactions
To generate this training set, we used one of three different genome scaleannotations: the COG functional categories assigned by NCBI[21], the GO[22]annotations assigned by EBI’s GOA project[23], and the KEGG[24] metabolicannotations assigned to microbial genomes In general, as we move from COG to
GO to KEGG, the fraction of annotated proteins in a given organism decreases,but the annotation quality increases In this work we used the KEGG annotation
for all organisms other than Bacillus subtilis, for which we used GO as KEGG
data was unavailable
As shown in Figure 1a, for each pair we recorded (L = 1) if the proteins had overlapping annotations, (L = 0) if both were in entirely nonoverlapping cate- gories, and (L = ?) if either protein lacked an annotation code or was marked
as unknown (For the GO training set, “overlapping” was defined as overlap
Trang 22(b) Evidence vs Training Set
Fig 1 Training Sets and Evidence (a) Genome-scale systematic annotations such as
COG, GO or KEGG give functions for proteins X i As described in the text and shown
on example data, we use this annotation to build an initial classification of protein pairs
(X i , X j) with three categories: a relatively small set of likely linked (red) pairs andunlinked (blue) pairs, and a much larger set of uncertain (gray) pairs (b) We observethat proteins which share an annotation category generally have more significant levels
of evidence, as seen in the shifted distribution of linked (red) vs unlinked (blue) pairs.Even subtle distributional differences contribute statistical resolution to our algorithm
of specific GO categories beyond the 8th level of the hierarchy.) This “matrix”approach (consider all proteins within an annotation category as linked) is incontrast to the “hub-spoke” approach (consider only proteins known to be di-rectly in contact as linked) [25] The former representation produces a nontrivialnumber of false positives, while the latter incurs a surfeit of false negatives Wechose the “matrix” based training set because our algorithm is robust to noise
in the training set so long as enough data is present
Note that we have used an annotation on individual proteins to produce a
training set on pairs of proteins In Figure 1b, we compare this training set to
four functional genomic predictors: coexpression, coinheritance, coevolution, andcolocation We include details of the calculations of each evidence type in theAppendix Interestingly, despite the fact that these methods were obtained fromraw measurements as distinct as genomic spacing, BLAST bit scores, phyloge-netic trees, and microarray traces, Figure 1b shows that each method is capable
of distinguishing functionally linked pairs (L = 1) from unlinked pairs (L = 0).
2.3 Network Integration
For clarity, we first illustrate network integration with two evidence types
(cor-responding to two Euclidean dimensions) in C crescentus, and then move to the
N-dimensional case
Trang 23Integrated Protein Interaction Networks for 11 Microbes 5
Fig 2 2D Network Integration in C crescentus (a) A scatterplot reveals that
func-tionally linked pairs (red,L = 1) tend to have higher coexpression and coinheritance than pairs known to participate in separate pathways (blue,L = 0) (b) We build the conditional densities P (E1, E2|L = 0) and P (E1, E2|L = 1) through kernel density es-
timation Note that the distribution for linked pairs is shifted to the upper right cornerrelative to the unlinked pair distribution (c) We can visualize the classification process
by concentrating on the decision boundary, corresponding to the upper right quadrant
of the original plot In the left panel, the scatterplot of pairs with unknown linkagestatus (gray) are the inputs for which we wish to calculate interaction probabilities In
the right panel, a heatmap for the posterior probability P (L = 1 |E1, E2) is depicted.This function yields the probability of linkage given an input evidence vector, and in-creases as we move to higher levels of coexpression and coinheritance in the upper rightcorner (d) By conceptually superimposing each gray point upon the posterior, we cancalculate the posterior probability that two proteins are functionally linked
2D Network Integration Consider the set of approximately 310000 protein
pairs in C crescentus which have a KEGG-defined linkage of (L = 0) or (L = 1) Setting aside the 6.6 million pairs with (L = ?) for now, we find that P (L = 1) = 046 and P (L = 0) = 954 are the relative proportions of known linked and
unlinked pairs in our training set
Each of these pairs has an associated coexpression and coinheritance lation, possibly with missing values, which we bundle into a two dimensional
corre-vector E = (E1, E2) Figure 2a shows a scatterplot of E1vs E2, where pairs with
(L = 1) have been marked red and pairs with (L = 0) have been marked blue.
We see immediately that functionally linked pairs aggregate in the upperright corner of the plot, in the region of high coexpression and coinheritance
Trang 246 B.S Srinivasan et al.
Crucially, the linked pairs (red) are more easily distinguished from the unlinkedpairs (blue) in the 2-dimensional scatter plot than they are in the accompany-ing 1-dimensional marginals To quantify the extent to which this is true, we
begin by computing P (E1, E2|L = 0) and P (E1, E2|L = 1) via kernel density estimation[26, 27], as shown in Figure 2b As we already know P (L), we can
obtain the posterior by Bayes’ rule:
P (L = 1 |E1, E2) = P (E1, E2|L = 1)P (L = 1)
P (E1, E2|L = 1)P (L = 1) + P (E1, E2|L = 0)P (L = 0)
In practice, this expression is quite sensitive to fluctuations in the denominator
To deal with this, we use M -fold bootstrap aggregation[28] to smooth the rior We find that M = 20 repetitions with resampling of 1000 elements from the (L = 0) and (L = 1) training sets is the empirical point of diminishing returns
poste-in terms of area under the receiver-operator characteric (ROC), as detailed poste-inFigure 4
Given this posterior, we can now make use of the roughly 6.6 million pairs with
(L = ?) which we put aside at the outset, as pictured in Figure 2c Even though these pairs have unknown linkage, for most pairs the coexpression (E1) and
coinheritance (E2) are known For those pairs which have partially missing data(e.g from corrupted spots on a microarray), we can simply evaluate over the non-
missing elements of the E vector by using the appropriate marginal posterior
P (L = 1 |E1) or P (L = 1 |E2) We can thus calculate P (L = 1 |E1, E2) for everypair of proteins in the proteome, as shown in Figure 2d Each of the formerly gray
pairs with (L = ?) is assigned a probability of interaction by this function; those
with bright red values in Figure 2d are highly likely to be functionally linked
In general, we also calculate P (L = 1 |E1, E2) on the training data, as weknow that the “matrix” approach to training set generation produces copiousbut noisy data The result of this evaluation is the probability of interaction forevery protein pair
N-dimensional Network Integration The 2 dimensional example in C
cres-centus immediately generalizes to N-dimensional network integration in an
arbi-trary species, though the results cannot be easily visualized beyond 3 dimensions
Figure 3 shows the results of calculating a 3D posterior in C crescentus from
co-expression, coinheritance, and colocation data, where we have once again applied
M -fold bootstrap aggregation.
We see that different evidence types interact in nonobvious ways For
exam-ple, we note that high levels of colocation (E2) can compensate for low levels
of coexpression (E1), as indicated by the “bump” in the posterior of Figure 3c
Biologically speaking, this means that a nontrivial number of C crescentus
pro-teins with shared function are frequently colocated yet not strongly coexpressed.This is exactly the sort of subtle statistical dependence between predictors that
is crucial for proper classification In fact, a theoretically attractive property of
Trang 25Integrated Protein Interaction Networks for 11 Microbes 7
Fig 3 3D Network Integration in C crescentus (a)-(b) We show level sets of each
density spaced at even volumetric increments, so that the inner most shell encloses20% of the volume, the second shell encloses 40%, and so forth As in the 2D case, the
3D density P (E |L = 1) is shifted to the upper right corner (c) For the posterior, we
show level sets spaced at probability deciles, such that a pair which makes it past the
upper right shell has P (L = 1 |E) ∈ [.9, 1], a pair which lands in between the upper two
shells satisfies P (L = 1 |E) ∈ [.8, 9], and so on.
our approach is that the use of the conditional joint posterior produces the imum possible classification error (specifically, the Bayes error rate [29]), whilebootstrap aggregation protects us against overfitting[30]
min-Until recently, though, technical obstacles made it challenging to efficientlycompute joint densities beyond dimension 3 Recent developments[26] in efficientkernel density estimation have obviated this difficulty and have made it possible
to evaluate high dimensional densities over millions of points in a reasonableamount of time within user-specifiable tolerance levels As an example of thecalculation necessary for network integration, consider a 4 dimensional kerneldensity estimate built from 1000 sample points Ihler’s implementation[27] ofthe Gray-Moore dual-tree algorithm[26] allowed the evaluation of this density atthe
3737
2
≈ 7, 000, 000 pairs in the C crescentus proteome in only 21 minutes
on a 3GHz Xeon with 2GB RAM Even after accounting for the 2M multiple
of this running time caused by evaluating a quotient of two densities and using
M -fold bootstrap aggregation, the resulting joint conditional posterior can be
built and evaluated rapidly enough to render approximation unnecessary
Binary Classifier Perspective By formulating the network integration
prob-lem as a binary classifier (Figure 4), we can quantify the extent to which theintegration of multiple evidence sources improves prediction accuracy over a sin-gle source As our training data is necessarily a rough approximation of the trueinteraction network, these measures are likely to be conservative estimates ofclassifier performance
Trang 26Network Integration Boosts ROC Performance
False Positive Rate
Network Integration Boosts ROC Performance
False Positive Rate
Network Integration Boosts ROC Performance
False Positive Rate
Network Integration Boosts ROC Performance
False Positive Rate
Network Integration Boosts ROC Performance
False Positive Rate
CV: AUROC=.572 CV+CL: AUROC=.638 CV+CL+CX: AUROC=.678 Naive Bayes: AUROC=.675 CV+CL+CX+CI: AUROC=.711
(c) Precision/Recall Curves
Fig 4 Network Integration as Binary Classifier (a) We regard the network integration
problem as a binary classifier in a high dimensional feature space The input features
are a set of evidences associated with a protein pair (A, B), and the output is the probability that a pair is assigned to the (L = 1) category (b) The area under the
receiver operator characteristic (AUROC) is a standard measure[29] of binary classifier
performance, shown here for several different ways of doing C crescentus network tegration Here we have labeled data types as CV (coevolution), CL (colocation), CX (coexpression), and CI (coinheritance) and shown a successive series of curves for the
in-integration of 1,2,3, and finally 4 evidence types Classifier performance increases tonically as more data sets are combined Importantly, the true four dimensional joint
mono-posterior P (L = 1 |CV, CL, CX, CI) outperforms the Naive Bayes approximation of the
posterior, where the conditional density P (CV, CL, CX, CI |L = 1) is approximated
by P (CV |L = 1)P (CL|L = 1)P (CX|L = 1)P (CI|L = 1), and similarly for L = 0 For
clarity we have omitted the individual curves for the CL (AUROC=.612), CX ROC=.619), and CV (AUROC=.653) metrics Again, it is clear that the integrated
(AU-posterior outperforms each of these univariate predictors (c) Precision/recall curves are
an alternate way of visualizing classifier performance, and are useful when the number
of true positives is scarce relative to the number of false negatives Again the integratedposterior outperforms the Naive Bayes approximation as a classifier Note that sincethe “negative” pairs from the KEGG training set are based on the supposition that two
proteins which have no annotational overlap genuinely do not share a pathway, they
are a more noisy indicator than the “positive” pairs That is, with respect to functionalinteraction, absence of evidence is not always evidence of absence Hence the computedvalues for precision are likely to be conservative underestimates of the true values
Trang 27Integrated Protein Interaction Networks for 11 Microbes 9
3.1 Global Network Architecture
Applying the posterior P (L = 1 |E) to every pair of proteins in a genome gives the
probability that each pair is functionally linked If we simply threshold this result
at P (L = 1 |E) > 5, we will retain only those linkages which are more probable
than not This decision rule attains the Bayes error rate[29] and minimizes themisclassification probability We applied our algorithm with this threshold tobuild 4D integrated networks for the 11 microbes and four evidence types listed inthe Appendix Figure 5 shows the global protein interaction networks produced
for three of these microbes, where we have retained only those edges with P (L =
1|E) > 5.
To facilitate use of these protein interaction networks, we built an interactivenetbrowser, viewable at http://jinome.stanford.edu/netbrowser As a threshold
of P (L = 1 |E) > 5 tends to be somewhat stringent in practice, we allow
dy-namic, user-specified thresholds to produce module-specific tradeoffs betweenspecificity and sensitivity in addition to a host of other customization options
Fig 5 Global visualization of integrated networks for Escherichia coli K12,
He-licobacter pylori 26695, and Caulobacter crescentus Only linkages with P (L =
1|E1, E2, E3, E4) > 5 are displayed.
3.2 Campylobacter jejuni : N-Linked Protein Glycosylation
N-linked protein glycosylation is one of the most frequent post-translationalmodifications applied to eukaryotic secretory proteins Until recently[31] thisprocess was thought to be absent from most microbes, but recent work[32] has
shown that an operational N-linked glycosylation system does exist in C jejuni.
As the entire glycosylation apparatus can be successfully transplanted to E coli
K12, this system is of much biotechnological interest[33]
Figure 6a shows the results of examining the integrated network for C jejuni
around the vicinity of Cj1124c, one of the proteins in the glycosylation system Inaddition to the reassuring recapitulation of several transferases and epimerasesexperimentally linked to this process[33], we note four proteins which are toour knowledge not known to be implicated in N-linked glycosylation (Cj1518,
Trang 2810 B.S Srinivasan et al.
Cj0881c, Cj0156c, Cj0128c) Importantly, all of these heretofore uncharacterizedlinkages would have been missed if only univariate posteriors had been exam-
ined, as they would be significantly below our cutoff of P (L = 1 |E) > 5 As
this system is still poorly understood – yet of substantial biotechnological andpathogenic[34] relevance – investigation of these new proteins may be of interest
3.3 Caulobacter crescentus: Bacterial Actin and the Sec Apparatus
Van den Ent’s[36] discovery that the ubiquitous microbial protein MreB was astructural homolog to actin spurred a burst of interest[37, 38, 39] in the biology
of the bacterial cytoskeleton Perhaps the most visually arresting of these cent findings is the revelation that MreB supports the cell by forming a tightspiral[37] Yet many outstanding questions in this field remain, and prime amongthem is the issue of which proteins communicate with the bacterial cytoskeletalapparatus[40]
re-Figure 6b shows the proteins from the C crescentus integrated network which
have a 50% chance or greater of interacting with MreB, also known as CC1543
As a baseline measure of validity, we once again observe that known interactionpartners such as RodA (CC1547) and MreC (CC1544) are recovered by networkintegration More interesting, however, is the subtle interaction between MreBand the preprotein translocase CC3206, an interaction that would be missed ifdata sources were used separately This protein is a subunit of the Sec machinery,
(b) C crescentus: Bacterial cytoskeleton
Fig 6 Case Studies (a) Network integration detects new proteins linked to
glycosy-lation in Campylobacter jejuni NCTC 11168 High probability linkages are labeled in
red and generally recapitulate known interactions, while moderately likely linkages arecolored gray Moderate linkages are generally not found by any univariate method inisolation, and represent the new biological insight produced by data integration (b)
In Caulobacter crescentus, data integration reveals that the Sec apparatus is linked
to MreB, a prediction recently confirmed by experiment[35] Again, moderate linkagesrevealed by data integration lead us to a conclusion that would be missed if univariatedata was used
Trang 29Integrated Protein Interaction Networks for 11 Microbes 11
and like MreB is an ancient component of the bacterial cell[41] Its link to MreB is
of particular note because recent findings[35] have shown that the Sec apparatus– like MreB – has a spiral localization pattern While seemingly counterintu-itive, it seems likely from both this finding and other work[42] that the export
of cytoskeleton-related proteins beyond the cellular membrane is important inthe process of cell division We believe that investigation of the hypotheticalproteins linked to both MreB and Sec by our algorithm may shed light on thisquestion
4.1 Merits of Our Approach
While a number of recent papers on network integration in S cerevisiae have
ap-peared, we believe that our method is an improvement over existing algorithms.First, by directly calculating the joint conditional posterior we require nosimplifying assumptions about statistical dependence and need no complex para-metric inference In particular, removing the Naive Bayes approximation results
in a better classifier, as quantified in Figure 4 Second, our use of the Gray-Mooredual tree algorithm means that our method is arbitrarily scalable in terms ofboth the number of evidence types and the number of protein pairs Third, ourmethod allows immediate visual identification of dependent or corrupted func-tional genomic data in terms of red/blue separation scatterplots – an importantconsideration given the noise of some data types [43] Finally, because the out-put of our algorithm is a rigorously derived set of interaction probabilities, itrepresents a solid foundation for future work
4.2 Conclusion and Future Directions
Our general framework presents much room for future development It isstraightforward to generalize our algorithm to apply to discrete, ordinal, or cat-egorical data sets as long as appropriate similarity measures are defined As ourmethod readily scales beyond a few thousand proteins, even the largest eukary-otic genomes are potential application domains It may also be possible to im-prove our inference algorithm through the use of statistical techniques designed
to deal with missing data[44]
Moving beyond a binary classifier would allow us to predict different kinds
of functional linkage, as two proteins in the same multiprotein complex have adifferent kind of linkage than two proteins which are members of the same regu-lon This would be significant in that it addresses one of the most widely voicedcriticisms of functional genomics, which is that linkage predictions are “one-size-fits-all” It may also be useful to move beyond symmetric pairwise measures ofassociation to use metrics defined on protein triplets[8] or asymmetric metrics
such that E(P i , P j)= E(Pj , P i)
While these details of the network construction process are doubtless subjectsfor future research, perhaps the most interesting prospect raised by the availabil-ity of a large number of robust, integrated interaction networks is the possibility
Trang 3012 B.S Srinivasan et al.
of comparative modular biology Specifically, we would like to align subgraphs
of interaction networks on the basis of conserved interaction as well as conservedsequence, just as we align DNA and protein sequences A need now exists for anetwork alignment algorithm capable of scaling to large datasets and comparingmany species simultaneously
Acknowledgments
We thank Lucy Shapiro, Roy Welch, and Arend Sidow for helpful discussions.BSS was supported in part by a DoD/NDSEG graduate fellowship, and HHMand BSS were supported by NIH grant 1 R24 GM073011-01 and DOE Office ofScience grant DE-FG02-01ER63219 JAF was supported in part by a StanfordGraduate Fellowship, and SB, AFN, and JAF were funded by NSF grant EF-
0312459, NIH grant UO1-HG003162, the NSF CAREER Award, and the Alfred
P Sloan Fellowship
Authors’ Contributions
BSS developed the network integration algorithm and wrote the paper AFNdesigned the web interface with JAF under the direction of SB and provideduseful feedback on network quality HHM and SB provided helpful commentsand a nurturing environment
References
1 Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G.D., Maltsev, N.: The use of
gene clusters to infer functional coupling Proc Natl Acad Sci U S A 96 (1999)
2896–2901
2 McAdams, H.H., Srinivasan, B., Arkin, A.P.: The evolution of genetic regulatory
systems in bacteria Nat Rev Genet 5 (2004) 169–178
3 Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: Quantitative monitoring of
gene expression patterns with a complementary DNA microarray Science 270
(1995) 467–470
4 Enright, A.J., Iliopoulos, I., Kyrpides, N.C., Ouzounis, C.A.: Protein interaction
maps for complete genomes based on gene fusion events Nature 402 (1999) 86–90
5 Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: signing protein functions by comparative genome analysis: protein phylogenetic
As-profiles Proc Natl Acad Sci U S A 96 (1999) 4285–4288
6 Srinivasan, B.S., Caberoy, N.B., Suen, G., Taylor, R.G., Shah, R., Tengra, F.,Goldman, B.S., Garza, A.G., Welch, R.D.: Functional genome annotation through
phylogenomic mapping Nat Biotechnol 23 (2005) 691–698
7 Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.D.J., Bertin, N.,Chung, S., Vidal, M., Gerstein, M.: Annotation transfer between genomes: protein-
protein interologs and protein-DNA regulogs Genome Res 14 (2004) 1107–1118
8 Bowers, P.M., Cokus, S.J., Eisenberg, D., Yeates, T.O.: Use of logic relationships
to decipher protein network organization Science 306 (2004) 2246–2249
Trang 31Integrated Protein Interaction Networks for 11 Microbes 13
9 Pazos, F., Valencia, A.: Similarity of phylogenetic trees as indicator of
protein-protein interaction Protein Eng 14 (2001) 609–614 Evaluation Studies.
10 Gerstein, M., Lan, N., Jansen, R.: Proteomics Integrating interactomes Science
14 Lee, I., Date, S.V., Adai, A.T., Marcotte, E.M.: A probabilistic functional network
of yeast genes Science 306 (2004) 1555–1558
15 Tanay, A., Sharan, R., Kupiec, M., Shamir, R.: Revealing modularity andorganization in the yeast molecular network by integrated analysis of highly
heterogeneous genomewide data Proc Natl Acad Sci U S A 101 (2004) 2981–2986
16 Wong, S.L., Zhang, L.V., Tong, A.H.Y., Li, Z., Goldberg, D.S., King, O.D.,Lesage, G., Vidal, M., Andrews, B., Bussey, H., Boone, C., Roth, F.P.: Combining
biological networks to predict genetic interactions Proc Natl Acad Sci U S A 101
(2004) 15682–15687
17 Lu, L.J., Xia, Y., Paccanaro, A., Yu, H., Gerstein, M.: Assessing the limits of
genomic data integration for predicting protein networks Genome Res 15 (2005)
945–953
18 Friedman, A., Perrimon, N.: Genome-wide high-throughput screens in functional
genomics Curr Opin Genet Dev 14 (2004) 470–476
19 Hartwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W.: From molecular to
modular cell biology Nature 402 (1999) 47–52
20 Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I.,Koonin, E.V., Altschul, S.F.: Improving the accuracy of PSI-BLAST proteindatabase searches with composition-based statistics and other refinements Nucleic
Acids Res 29 (2001) 2994–3005
21 Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin,E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S.,Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.:The COG database: an updated version includes eukaryotes BMC Bioinformatics
4 (2003) 41
22 Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M.,Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P.,Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald,M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology
The Gene Ontology Consortium Nat Genet 25 (2000) 25–29
23 Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns,D., Harte, N., Lopez, R., Apweiler, R.: The Gene Ontology Annotation (GOA)Database: sharing knowledge in Uniprot with Gene Ontology Nucleic Acids Res
32 (2004) 262–266
24 Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG
resource for deciphering the genome Nucleic Acids Res 32 (2004) 277–280
Trang 3214 B.S Srinivasan et al.
25 Bader, G.D., Hogue, C.W.V.: Analyzing yeast protein-protein interaction data
obtained from different sources Nat Biotechnol 20 (2002) 991–997
26 Gray, A.G., Moore, A.W.: ‘n-body’ problems in statistical learning In: NIPS.(2000) 521–527
27 Ihler, A., Sudderth, E., Freeman, W., Willsky, A.: Efficient multiscale samplingfrom products of gaussian mixtures In: NIPS (2003)
28 Breiman, L.: Bagging predictors Machine Learning 24 (1996) 123–140
29 Duda, R., Hart, P., Stork, D.: Pattern Classification Wiley-IntersciencePublication (2000)
30 Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants Machine Learning 36 (1999) 105–139
31 Szymanski, C.M., Logan, S.M., Linton, D., Wren, B.W.: Campylobacter–a tale of
two protein glycosylation systems Trends Microbiol 11 (2003) 233–238
32 Wacker, M., Linton, D., Hitchen, P.G., Nita-Lazar, M., Haslam, S.M., North, S.J.,Panico, M., Morris, H.R., Dell, A., Wren, B.W., Aebi, M.: N-linked glycosylation
in Campylobacter jejuni and its functional transfer into E coli Science 298
150 (2004) 1957–1964
35 Campo, N., Tjalsma, H., Buist, G., Stepniak, D., Meijer, M., Veenhuis, M., ermann, M., Muller, J.P., Bron, S., Kok, J., Kuipers, O.P., Jongbloed, J.D.H.:
West-Subcellular sites for bacterial protein export Mol Microbiol 53 (2004) 1583–1599
36 van den Ent, F., Amos, L.A., Lowe, J.: Prokaryotic origin of the actin cytoskeleton
Nature 413 (2001) 39–44
37 Gitai, Z., Dye, N., Shapiro, L.: An actin-like gene can determine cell polarity in
bacteria Proc Natl Acad Sci U S A 101 (2004) 8643–8648
38 Kurner, J., Frangakis, A.S., Baumeister, W.: Cryo-electron tomography reveals
the cytoskeletal structure of Spiroplasma melliferum Science 307 (2005) 436–438
39 Gerdes, K., Moller-Jensen, J., Ebersbach, G., Kruse, T., Nordstrom, K.: Bacterial
mitotic machineries Cell 116 (2004) 359–366
40 Cabeen, M.T., Jacobs-Wagner, C.: Bacterial cell shape Nat Rev Microbiol 3
(2005) 601–610
41 Vrontou, E., Economou, A.: Structure and function of SecA, the preprotein
translocase nanomotor Biochim Biophys Acta 1694 (2004) 67–80
42 Kruse, T., Bork-Jensen, J., Gerdes, K.: The morphogenetic MreBCD proteins of
Escherichia coli form an essential membrane-bound complex Mol Microbiol 55
(2005) 78–89
43 Vidalain, P.O., Boxem, M., Ge, H., Li, S., Vidal, M.: Increasing specificity in
high-throughput yeast two-hybrid experiments Methods 32 (2004) 363–370
44 McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions John Wileyand Sons (1996)
Trang 33Hypergraph Model of Multi-residue Interactions
in Proteins: Sequentially–Constrained
Partitioning Algorithms for Optimization of Site-Directed Protein Recombination
Xiaoduan Ye1, Alan M Friedman2, and Chris Bailey-Kellogg1
1 Department of Computer Science, Dartmouth College,
6211 Sudikoff Laboratory, Hanover NH 03755, USA
{ye, cbk}@cs.dartmouth.edu
2
Department of Biological Sciences and Purdue Cancer Center,
Purdue University, West Lafayette, IN 47907, USA
afried@purdue.edu
Abstract Relationships among amino acids determine stability and
function and are also constrained by evolutionary history We develop
a probabilistic hypergraph model of residue relationships that izes traditional pairwise contact potentials to account for the statistics
general-of multi-residue interactions Using this model, we detected non-randomassociations in protein families and in the protein database We alsouse this model in optimizing site-directed recombination experiments
to preserve significant interactions and thereby increase the frequency
of generating useful recombinants We formulate the optimization as asequentially-constrained hypergraph partitioning problem; the quality ofrecombinant libraries wrt a set of breakpoints is characterized by the to-tal perturbation to edge weights We prove this problem to be NP-hard
in general, but develop exact and heuristic polynomial-time algorithmsfor a number of important cases Application to the beta-lactamase fam-ily demonstrates the utility of our algorithms in planning site-directedrecombination
The non-random association of amino acids, as expressed in pairwise potentials,has been usefully applied in a number of situations Such pairwise contact poten-tials [1, 2] play a large role in evaluating quality of models in protein structureprediction [3, 4, 5, 6] It has been suggested, however, that “it is unlikely thatpurely pairwise potentials are sufficient for structure prediction” [7, 8]
To better model evolutionary relationships that determine protein stabilityand functionality, it may be necessary to capture the higher-order interactionsthat are ignored in simple pairwise models (Fig 1(a)) Researchers have begun
to demonstrate the importance of accounting for higher-order terms A tistical pseudo-potential based on four-body nearest neighbor interactions (as
sta-A Apostolico et al (Eds.): RECOMB 2006, LNBI 3909, pp 15–29, 2006.
c
Springer-Verlag Berlin Heidelberg 2006
Trang 3416 X Ye, A.M Friedman, C Bailey-Kellogg
E E
T
Fig 1 Hypergraph model of evolutionary interactions, and effects of site-directed
protein recombination (a) Higher-order evolutionary interactions (here, order-3)determining protein stability and function are observed in the statistics of “hypercon-servation” of mutually interacting positions The left edge is dominated by Ala,Val,Ileand Val,Leu,Leu interactions, while the right is dominated by Glu,Thr,Arg andAsp,Ser,Lys ones The interactions are modeled as edges in a hypergraph with weightsevaluating the degree of hyperconservation of an interaction, both generally in theprotein database and specific to a particular family (b) Site-directed recombinationmixes and matches sequential fragments of homologous parents to construct a library
of hybrids with the same basic structure but somewhat different sequences and thusdifferent functions (c) Site-directed recombination perturbs edges that cross one ormore breakpoints The difference in edge weights derived for the parents and thosederived for the hybrids indicates the effect of the perturbation on maintenance ofevolutionarily favorable interactions
determined by Delaunay tessellations) has successfully predicted changes in freeenergy caused by hydrophobic core mutations [8] Similar formulations have beenused to discriminate native from non-native protein conformations [9] Geomet-rically less restricted higher-order interactions have also been utilized for recog-nition of native-like protein structures [10] Recent work on correlated mutationanalysis has moved from identifying pairwise correlations [11] to determiningclusters or cliques of mutually-dependent residues that identify subclasses within
a protein family and provide mechanistic insights into function [12, 13]
This paper develops a rigorous basis for representing multi-order interactionswithin a protein family We generalize the traditional representations of sequenceinformation in terms of single-position conservation and structural interactions interms of pairwise contacts Instead, we define a hypergraph model in which edgesrepresent pairwise and higher-order residue interactions, while edge weights rep-resent the degree of “hyperconservation” of the interacting residues (Sec 2) Hy-perconservation can reveal significant residue interactions both within members
of the family (arising from structural and functional constraints) and generallycommon to all proteins (arising from general properties of the amino acids) Wethen combine family-specific and database-wide statistics with suitable weight-ing (Sec 2.1), ensure non-redundancy of the information in super- and sub-edgeswith a multi-order potential score (Sec 2.2), and derive edge weights by mean po-tential scores (Sec 2.3) Application of our approach to beta-lactamases (Sec 4)shows that the effect of non-redundant higher-order terms is significant and can
be effectively handled by our model
Trang 35Hypergraph Model of Multi-residue Interactions in Proteins 17
Protein recombination in vitro (Fig 1(b)) enables the design of protein
vari-ants with favorable properties and novel enzymatic activities, as well as
the exploration of protein sequence-structure-function relationships (see e.g.
[14, 15, 16, 17, 18, 19, 20, 21, 22]) In this approach, libraries of hybrid proteinsare generated either by stochastic enzymatic reactions or intentional selection ofbreakpoints Hybrids with unusual properties can either be identified by large-scale genetic screening and selection, or many hybrids can be evaluated individ-ually to determine detailed sequence-function relationships for understandingand/or rational engineering We focus here on site-directed recombination, inwhich parent genes are recombined at specified breakpoint locations, yieldinghybrids in which different sequence fragments (between the breakpoints) cancome from different parents Both screening/selection and investigational exper-iments benefit from recombination that preserves the most essential structuraland functional features while still allowing variation In order to enhance the suc-cess of this approach, it is necessary to choose breakpoint locations that optimizepreservation of these features
The labs of Mayo and Arnold [18, 23] have established criteria for disruption of contacting residue pairs and demonstrated the relationship betweennon-disruption and functional hybrids [18] There is an on-going search for algo-rithms to select breakpoints for recombination based on non-disruption [23, 24],although none has yet been experimentally validated Optimizing multi-orderinteractions after recombination (Fig 1(c)) should help identify the best recom-binants and thus the best locations for breakpoints In support of this optimiza-tion, we develop criteria to evaluate the quality of hybrid libraries by consideringthe effects of recombination on edge weights (Sec 2.4) We then formulate the op-timal selection of breakpoint locations as a sequentially-constrained hypergraphpartitioning problem (Sec 3), prove it to be NP-hard in general (Sec 3.1), de-velop exact and heuristic algorithms for a number of important cases (Secs 3.2–3.5), and demonstrate their practical effectiveness in design of recombinationexperiments for members of the beta-lactamase family (Sec 4)
In order to more completely model statistical interactions in a protein, it is essary to move beyond single-position sequence conservation and pairwise struc-tural contact We model a protein and its reference structure with a weighted
nec-hypergraph G = (V, E, w), where vertices V = {v1, v2, · · · , v |V | } represent residue positions in sequential order on the backbone, edges E ⊆ 2 V represent mutually
interacting sets of vertices, and weight function w : E → R represents the ative significance of edges We construct an order-c edge e = v1, v2, · · · , vc for
rel-each set of residues (listed in sequential order for convenience) that are in mutualcontact; this construction can readily be extended to capture other forms of inter-
action, e.g long-range interaction of non-contacting residues due to electrostatics.
Note that subsets of vertices associated with a higher-order edge form lower-order
edges When we need to specify the exact order c of edges in a hypergraph, we use
Trang 3618 X Ye, A.M Friedman, C Bailey-Kellogg
notation G c = (V, E c , w) Since lower-order edges can be regarded as a special kind
of higher-order ones, G cincludes “virtual” lower-order edges
The definition of the edge weight is key to effective use of the hypergraphmodel In the case where the protein is a member of a family with presumedsimilar structures, edge weights can be evaluated both from the general databaseand specific to the family There are many observed residue values (across thefamily or database) for the vertices of any given edge We thus build up to an edgeweight by first estimating the probability of the residue values, then decomposingthe probability to ensure non-redundant information among multi-order edgesfor the same positions Finally we determine the effect on the pattern of thesevalues due to recombination according to a set of chosen breakpoint locations
2.1 Distribution of Hyperresidues in Database and Family
Let R = r1, r2, · · · , rc be a “hyperresidue,” a c-tuple of amino acid types (e.g Ala, Val, Ile) Intuitively speaking, the more frequently a particular hy-
perresidue occurs in functional proteins, the more important it is expected to
be for their folding and function We can estimate the overall probability p of
hyperresidues from their frequencies in the databaseD of protein sequences and
corresponding structures:
where|D| represents the number of tuple instances in the database When
con-sidering a specific protein family F with a multiple sequence alignment and shared structure, we can estimate position-specific (i.e., for edge e) probability
qe(R) = ω1· p(R) + ω2· pe(R) , (3)but employing weights suitable for our problem:
ω1= 1/(1 + |F|ρ) and ω2= 1− ω1 , (4)
where ρ is a user-specified parameter that determines the relative contributions
of database and family Note that when ρ = 0, q e (R) = p(R) and the specific information is ignored; whereas when ρ = ∞, qe (R) = p e (R) and the database information is ignored Using a suitable value of ρ, we will obtain a
family-probability distribution that is close to the overall database distribution for asmall family but approximates the family distribution for a large one
Trang 37Hypergraph Model of Multi-residue Interactions in Proteins 19
2.2 Multi-order Potential Score for Hyperresidues
Since we have multi-order edges, with lower-order subsets included alongsidetheir higher-order supersets, we must ensure that these edges are not redundant
In other words, a higher-order edge should only include information not captured
by its lower-order constituents The inclusion-exclusion principle ensures
non-redundancy in a probability expansion, as Simons et al [10] demonstrated in
the case of protein structure prediction We define an analogous multi-orderpotential score for hyperresidues at edges of orders 1, 2, and 3, respectively, asfollows:
hyperconser-be defined similarly The potential score of a higher-order hyperresidue contains
no information redundant with that of its lower-order constituents
2.4 Edge Weights for Recombination
A particular form of edge weights serves as a guide for breakpoint selection insite-directed recombination Suppose a setS ⊆ F of parents is to be recombined
at a set X = {x1, x2, · · · , xn} of breakpoints, where xt = v i indicates that
breakpoint x t is between residues v i and v i+1 We can view recombination as
a two-step process: decomposing followed by recombining In the decomposing step, each protein sequence is partitioned into n + 1 intervals according to the breakpoints, and the hypergraph is partitioned into n + 1 disjoint subgraphs by
removing all edges spanning a breakpoint The impact of this decomposition can
be individually assessed for each edge, using Eq 8 for the parentsS.
Trang 3820 X Ye, A.M Friedman, C Bailey-Kellogg
In the recombining step, edges removed in the decomposing step are structed with new sets of hyperresidues according to all combinations of parentfragments The impact of this reconstruction can also be individually assessedfor each edge, yielding a breakpoint-specific weight:
recon-w(e, X) =
R
#R at e in L
In this case, the potential score of hyperresidue R is weighted by the amount of
its representation in the libraryL Note that we need not actually enumerate
the set of hybrids (which can be combinatorially large) in order to determinethe weight, as the frequencies of the residues at the positions are sufficient tocompute the frequencies of the hyperresidues
The combined effect of the two-step recombination process on an individual
edge, the edge perturbation, is then defined as the change in edge weight:
If all vertices of e are in one fragment, we have w(e) = w(e, X) and ∆w(e, X) = 0.
The edge perturbation thus integrates essential information from the database,family, parent sequences, and breakpoint locations
Given parent sequences, a set of breakpoints determines a hybrid library Thequality of this hybrid library can be measured by the total perturbation to alledges due to the breakpoints The hypothesis is that the lower the perturbation,the higher the representation of folded and functional hybrids in the library Weformulate the breakpoint selection problem as follows
Problem 1 c-RECOMB Given Gc = (V, Ec , w) and a positive integer n, choose
a set of breakpoints X = {x1, x2, · · · , xn } minimizinge ∈E c ∆w(e, X).
Recall from Sec 2 that Gc represents a hypergraph with edge order uniformly c (where edges with order less than c are also represented as order-c edges).
This hypergraph partitioning problem is significantly more specific than eral hypergraph partitioning, so it is interesting to consider its algorithmic dif-
gen-ficulty As as we will see in Sec 3.1, c-RECOMB is NP-hard for c = 4 (and thus also for c > 4), although we provide polynomial-time solutions for c = 2 in Sec 3.2 and c = 3 in Sec 3.4.
A special case of c-RECOMB provides an efficient heuristic approach to
min-imize the overall perturbation By minimizing the total weight of all edges
EX removed in the decomposing step, fewer interactions need to be recovered
in the recombining step
Problem 2 c-DECOMP Given Gc = (V, Ec, w) and a positive integer n, choose
a set of breakpoints X = {x1, x2, · · · , xn } minimizinge ∈E X w(e).
c-DECOMP could also be useful in identifying modular units in protein
struc-tures, in which case there is no recombining step
Trang 39Hypergraph Model of Multi-residue Interactions in Proteins 21
3.1 NP-Hardness of 4-RECOMB
4-RECOMB is combinatorial in the set X of breakpoints and the possible
con-figurations they can take relative to each edge The number of possible libraries
could be huge even with a small number of breakpoints (e.g choosing 7
break-points from 262 positions for beta-lactamase results in on the order of 1013ble configurations) The choices made for breakpoints are reflected in whether ornot there is a breakpoint between each pair of sequentially-ordered vertices of anedge, and thus in the perturbation to the edge We first give a decision version of
possi-4-RECOMB as follows and then prove that it is NP-hard Thus the related
opti-mization problem is also NP-hard Our reduction employs general hypergraphs;analysis in the geometrically-restricted case remains interesting future work
Problem 3 4-RECOMB-DEC Given G4= (V, E4, w), a positive integer n, and
an integer W , does there exist a set of breakpoints X = {x1, x2, · · · , xn} such
that
e ∈E4∆w(e, X) ≤ W
Theorem 3.1 4-RECOMB-DEC is NP-hard.
Proof We reduce from 3SAT Let φ = C1∧ C2∧ · · · ∧ Ck be a boolean formula
in 3-CNF with k clauses We shall construct a hypergraph G4= (V, E4, w) such that φ is satisfiable iff there is a 4-RECOMB-DEC solution for G4 with n = 3k breakpoints and W = −|E4| (See Fig 2.) For clause Ci = (li,1 ∨ li,2 ∨ li,3) in
φ, add to V four vertices in sequential order vi,1, vi,2, vi,3, and vi,4 Elongate
V with 3k trivial vertices (v
j in Fig 2), where we can put trivial breakpoints
that cause no perturbation Let us define predicate b(i, s, X) = v i,s ∈ X for
s ∈ {1, 2, 3}, indicating whether or not there is a breakpoint between vi,s and
v i,s+1 We also use indicator function I to convert a boolean value to 0 or 1.
We construct E4 with three kinds of edges: (1) For the 4-tuple of vertices for
clause Ci , add an edge e = vi,1, vi,2, vi,3, vi,4 with ∆w(e, X) = −I{b(i, 1, X) ∨ b(i, 2, X) ∨b(i, 3, X)} (2) If two literals li,s and lj,t are identical, add an edge e =
vi,s, vi,s+1, vj,t, vj,t+1 with ∆w(e, X) = −I{b(i, s, X) = b(j, t, X)} (3) If two literals li,s and lj,t are complementary, add an edge e = vi,s, vi,s+1, vj,t, vj,t+1 with ∆w(e, X) =
There are 7k vertices and at most k + 3k
2
= O(k2) edges, so the construction
takes polynomial time It is also a reduction First, if φ has a satisfying ment, choose breakpoints X = {vi,s|li,sis TRUE} plus additional breakpoints between the trivial vertices to reach 3k total Since each clause is satisfied, one
assign-of its literals is true, so there is a breakpoint in the corresponding edge e and its
perturbation is−1 Since literals must be used consistently, type 2 and 3 edges
also have−1 perturbation Thus 4-RECOMB-DEC is satisfied with n = 3k and
W = −|E4| Conversely, if there is a 4-RECOMB-DEC solution with breakpoints
X, then assign truth values to variables such that l i,s = b(i, s, X) for s ∈ {1, 2, 3} and i ∈ {1, 2, · · · , k} Since perturbation to type 1 edges is −1, there must be
at least one breakpoint in each clause vertex tuple, and thus a true literal inthe clause Since perturbation to type 2 and 3 edges is −1, literals are used
consistently
Trang 4022 X Ye, A.M Friedman, C Bailey-Kellogg
Fig 2 Construction of hypergraph G4 = (V, E4, w) from an instance of 3SAT φ =
(z1∨ ¯z2∨z3)∧(z2∨z3∨ ¯z4) Type 1 edges e1and e2ensure the satisfaction of clauses (−1
perturbation iff there is a breakpoint iff the literal is true and the clause is satisfied),
while type 3 edge e3 and type 2 edge e4 ensure the consistent use of literals (−1
perturbation iff the breakpoints are identical or complementary iff the variable has asingle value)
We note that 4-RECOMB-DEC is in NP, since given a set of breakpoints
X for parents S we can compute ∆w(e, X) for all edges in polynomial time (O( S4E)), and then must simply sum and compare to a provided threshold.
3.2 Dynamic Programming Framework
Despite the NP-hardness of the general sequentially-constrained hypergraph
par-titioning problem c-RECOMB, the structure of the problem (i.e the sequential
constraint) leads to efficient solutions for some important cases Suppose weare adding breakpoints one by one from left to right (N- to C-terminal) in the
sequence Then the additional perturbation to an edge e caused by adding point xt given previous breakpoints Xt −1={x1, x2, · · · , xt −1 } can be written:
break-∆∆w(e, X t −1 , x t ) = ∆w(e, X t)− ∆w(e, Xt −1 ) , (11)where X0 = ∅ and the additional perturbation caused by the first breakpoint
is ∆∆w(e, X0, x1) = ∆w(e, X1) Reusing notation, we indicate the total
ad-ditional perturbation to all edges as ∆∆w(E, Xt −1 , xt) Now, if the value of
∆∆w(E, Xt −1 , xt) can be determined by the positions of xt −1 and xt,
inde-pendent of previous breakpoints, then we can adopt the dynamic programming
approach shown below When the additional perturbation depends only on xt −1 and x t , we write it as ∆∆w(E, x t −1 , x t) to indicate the restricted dependence
Let d[t, τ ] be the minimum perturbation caused by t breakpoints with the rightmost at position τ If, for simplicity, we regard the right end of the sequence
as a trivial breakpoint that causes no perturbation, then d[n + 1, |V |] is the minimum perturbation caused by n breakpoints plus this trivial one, i.e the objective function for Problem 1 We can compute d recursively:
where δ is a user-specified minimum sequential distance between breakpoints.
The recurrence can be efficiently computed bottom-up in a dynamic ming style, due to its optimal substructure In the following, we instantiate this