For each protein, we searched for deletion events, that is, a pro-tein which has exactly the same domain arrangement, except for a single domain missing anywhere in the arrangement.. The
Trang 1January Weiner 3rd, Francois Beaussart and Erich Bornberg-Bauer
Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Mu¨nster, Germany
Proteins are well known to evolve not only by point
mutations, but also by modular rearrangements [1–
3] By and large, these rearrangements occur at the
level of domains, which are independent folding units
and have been proposed to represent the unit of
modular evolution [3,4] Most domains always form
the same combinations; that is, they are always
found next to the same neighbours For example,
domains found in ribosomal proteins are not found
elsewhere and are present always in the same
con-text Also, it has been reported that many domains
appear in a very much conserved order
(suprado-mains) [5], and that the frequent occurrence of
cer-tain modular arrangements (arrangements of modules
along a sequence) across phyla is the result of con-servation [6]
While few domains co-occur with many others at least once in the same protein, most domains have few partner domains, or are even always singletons [3,7–9] Well-known examples of highly linked domains occur-ring in many different combinations are the P-loop nucleotide triphosphate hydrolase domain, the epider-mal growth factor (EGF) domain, the SH3 domain, the P-kinase domain and the domains involved in the blood clotting cascade [1,10]
The phenomenon of differential arrangements has often been termed domain mobility [11] However, this term may be misleading as it implies that single
Keywords
domain loss; fission; fusion; protein
domains; protein evolution
Correspondence
E Bornberg-Bauer, Division of
Bioinformatics, School of Biological
Sciences,The Westfalian Wilhelms
University of Mu¨nster, Schlossplatz 4,
D48149 Mu¨nster, Germany
Fax: +49 251 8321631
Tel: +49 251 8321630
E-mail: ebb@uni-muenster.de
(Received 5 December 2005, revised 13
February 2006, accepted 9 March 2006)
doi:10.1111/j.1742-4658.2006.05220.x
The main mechanisms shaping the modular evolution of proteins are gene duplication, fusion and fission, recombination and loss of frag-ments While a large body of research has focused on duplications and fusions, we concentrated, in this study, on how domains are lost We investigated motif databases and introduced a measure of protein simi-larity that is based on domain arrangements Proteins are represented as strings of domains and comparison was based on the classic dynamic alignment scheme We found that domain losses and duplications were more frequent at the ends of proteins We showed that losses can be explained by the introduction of start and stop codons which render the terminal domains nonfunctional, such that further shortening, until the whole domain is lost, is not evolutionarily selected against We demon-strated that domains which also occur as single-domain proteins are less likely to be lost at the N terminus and in the middle, than at the C ter-minus We conclude that fission⁄ fusion events with single-domain proteins occur mostly at the C terminus We found that domain substi-tutions are rare, in particular in the middle of proteins.We also showed that many cases of substitutions or losses result from erroneous annota-tions, but we were also able to find courses of evolutionary events where domains vanish over time This is explained by a case study on the bac-terial formate dehydrogenases
Abbreviations
Domain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.
Trang 2modules or small arrangements are being transferred
from one protein to another Considering that often
two modules or larger arrangements as such are
fused into one protein, it becomes difficult to defne
which of the modules is ‘mobile’ and which is
‘sta-tic’ Therefore, it has been suggested that the term
versatility ahould be used instead of domain mobility
[3,12] Independently of the perspective taken, the
underlying mechanisms of modular rearrangements
are mostly gene fusion and domain loss and,
prob-ably to a lesser extent, domain shuffling of exons
and recombination [13–17]
While the emergence of domain combinations is well
documented [4,6,7,18–21], relatively little is known
about domain losses
In this article, we focus on how domains are lost
Ultimately, this question is difficult to discern from the
recruitment of domains because, in comparing two
proteins, phylogenetic analysis is required to detect
whether a domain has been recruited in one protein or
lost in the other To deal with this problem, we
investi-gated the possible genetic mechanisms that can cause a
domain to be lost or gained
As usual in sequence analysis, information on the
history of evolution can only be assumed a
posteri-ori, meaning that disadvantagous mutations
(frame-shifts, domain deletions, etc.) have been weeded out
by negative selection Thus, we only observe events
of modular rearrangements that are either beneficial
or neutral For the sake of comprehensiveness, we
used the ProDom database [22], which records
conserved sequence fragments However, they are not
always identical to structural domains To confer
with the general definition of domains [3], all key
results were confirmed using Pfam, which largely
agrees with structural domain definitions [23]
In the following study we first investigated whether
the relative frequencies of deletions (or recruitements)
depend on if a domain is at the end or the middle of
a protein Unless explicitly stated, we used the term
‘deletion’ as synonymous for deletions and
recruit-ments We then investigated whether eliminations are
more frequently observed at the boundaries of
domains and whether or not domain substitutions are
frequent For that purpose, we categorized and
des-cribed misannotations of domains to discern them
from real substitutions or deletions of domains Next,
we studied whether some domains are more often lost
and whether frequencies of domain deletions depend
on domain versatility Finally, we discussed the
impli-cations of our results for a wider understanding of
modular protein evolution and the possibilities for
gen-erating a model in which modular protein evolution is
formally described in terms of module edit operations and cost functions
Results and Discussion
Single domain deletions The first question we asked was whether the probabil-ity of a domain deletion is evenly distributed through-out a protein The null hypothesis was that genetic mechanisms which lead to domain deletions (for exam-ple, deletions and insertions of sequence fragments, intron recombinations, etc.) do not depend on the position within the sequence However, two factors could cause a bias First, any point mutation that cre-ates a premature stop codon will cause a C-terminal deletion of a protein Likewise, a mutation leading to the emergence of an alternative transcription or trans-lation start will cause an N-terminal deletion Second,
a fission producing two genes from one will result in the deletion of a terminal fragment from a protein or, vice versa, a fusion of two smaller proteins into one will result in the observed pattern
We first grouped proteins by the number of domains they have (see the Materials and methods) For each protein, we searched for deletion events, that is, a pro-tein which has exactly the same domain arrangement, except for a single domain missing anywhere in the arrangement Then we calculated the frequency of the deletion at each domain position within the group of proteins containing a given number of domains
We found that the domain deletions are more com-mon at either of the protein termini, and that their occurrence is slightly higher at one of the termini, depending on the number of domains in the protein and the database selected (Fig 1) The prevalence of terminal deletions did not depend on the number of domains in proteins, and the results for Pfam and Pro-Dom databases were similar In only a few cases were slightly increased frequencies of domain deletions observed at a central position
According to our predictions, this suggests that the genetic mechanism of domain deletions acts predomin-antly on sequence termini Therefore, we tentatively propose that the insertions of new transcription start and stop codons, as well as gene fusion and fission, are more likely to occur than, for example, intron mobility caused by exon shuffling
Multiple domain deletions
We supported the previous findings by analysing cases where one or more domains were deleted from a
Trang 3protein We considered only deletions in which at least
half of the domains of the full length arrangement was
preserved, to ensure that homologous arrangements
were being compared The results were similar to those
of single domain deletions, in that the terminal
dele-tions were prevalent (see the Supplementary Material)
In many cases, a deleted domain is a part of a
lar-ger, deleted fragment We have found that fragments
deleted at either termini are, in general, much longer
than fragments deleted within a protein sequence The
deletions within the protein are much more often single
domain deletions (Fig 2) The total number of
dele-tions that concern only one, single domain, is higher
for the positions between the termini However, the
number of major deletions (deletions that span more
than one domain) is higher at terminal positions This
supports the view that the deletions generally involve
the protein termini
In-detail analysis of the deletion events
During our analyses, we noted that some of the
appar-ent domain deletions are actually just misannotations
A lack of a domain identifier at a given position in a
protein annotation does not necessarily mean that the
corresponding domain is physically deleted Likewise,
a different identifier does not necessarily signify a physical substitution To address this problem, we con-structed clusters of similar proteins that contained at
Position
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
Fig 1 Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins The figure shows the relative proportion of domain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains Dark grey, Pfam; Light grey, ProDom.
Length of the deleted fragment (in domains)
Fig 2 Number of occurrences of domain deletions as a function of the length (in domains) of the deleted fragment Diamonds, N-term-inal deletions; squares, deletions within the protein; circles, C-term-inal deletions Single domain losses occur preferentially on one of the middle positions, whereas longer fragments tend to be deleted
at the termini.
Trang 4least six ProDom domains We aligned the domain
arrangements within a cluster using a simple
progres-sive multiple alignment algorithm [24], based on
pairwise alignments generated using the
Needleman-Wunsch algorithm [25] (Supplementary material)
We were able to distinguish five types of
phenom-ena that resulted in an apparent deletion from the
domain arrangement (Table 1, Fig 3) The first two
were real substitutions and physical deletions of
domains In some cases, at the site where the domain
annotation was missing, there was, in fact, a sequence similar to the sequence of this domain However, because of length or large evolutionary distance, this sequence was not annotated by the automatic annota-tion mechanism of ProDom (‘erosion’) In other cases, if there is a high sequence variation between the instances of the domains with a given identifica-tion number (ID), homologous sequences can be assigned different ProDom identifiers (‘camouflage’) Yet, in other cases, although the annotation (ProDom
Table 1 Criteria used to distinguish between various types of sequence rearrangements and annotation artefacts that result in a disappear-ance of a domain in the domain string of a protein.
Evolutionary events
physical deletion a domain is physically deleted from the protein sequence, and only a short (<20 amino acids) fragment can
be found between the neighbouring domains substitution a domain is replaced by another domain that bears no similarity with the original domain
shadow domain at a given position, in one protein there is a ProDom domain; at the same position in another protein
there is an amino acid sequence which is not similar to the given domain and which does not correspond to a ProDom ID
Annotation artefacts
camouflage although there are two different ProDom domains at the same position in two proteins,
they are significantly similar (E<<1) erosion the domain is not annotated in ProDom, but there is at this position a similar amino acid sequence
Domain−wise evolutionary events Annotation artifacts
Substitution
A A
D B
C C
A
C
C
Shadow domain
seq
Deletion
A
C
A
D B
C C A
Camouflage
A
C C
Erosion
seq
E−value (B,D) ~ 1
E−value (B,seq) ~ 1
E−value (B,D) << 1
E−value (B,seq) << 1
Fig 3 Classification of domain-wise events observed in the domain databases Different evolutionary events (A, B, C) and annotation arte-facts (D, E) result in an apparent ‘deletion’ of a ProDom domain from a protein annotated in terms of ProDom domains Domain and dot plots can be found in the Supplementary material.
Trang 5ID) of a given domain is missing, there is no physical
deletion or misannotation Instead, the amino acid
sequence at this position is not similar to the given
ProDom domain; therefore, it is a case of a real
sub-stitution.We call this case a ‘shadow domain’
For each of these events, we counted its
occur-rence in the constructed protein clusters (see the
Materials and methods for details), at each position
in each protein cluster, as follows If a domain was
found to be deleted from an arrangement in a
clus-ter, the amino acid sequences occurring in all the
sequences of the cluster at the given position were
analysed We have applied the criteria from Table 1
to distinguish between the three types of real
evolu-tionary events (physical domain deletion, substitution
and shadow domains) and two types of annotation
artefacts (camouflage and erosion) In the case of
physical deletions, shadow domains and erosions, the
numbers of these events were simply counted
How-ever, in the case of substitutions and camouflage, it
is not reasonable to count the number of
occur-rences of such an event without inferring a direction
of the substitution For example, if at a certain
posi-tion in a cluster, domain A occurs in two sequences,
and each of the domains B and C occurs five times,
then what frequency of the substitutions should be
assumed here? We have used the following routine:
all possible pairwise combinations of domains from
different proteins occurring at the same domain
posi-tion in a cluster were analysed If the two domains
in a pair were different, then an event (substitution
or camouflage) was recorded Therefore, the
calcula-ted numbers of substitution and camouflage events
cannot be used to infer any conclusions on the
act-ual substitution rate of domains; however, because at
all domain positions the number of camouflage and
substitution events have been calculated in the same
way, relative frequencies of the camouflage and
sub-stitution events at different positions can be inferred
The relative frequencies of physical domain
dele-tions, substitutions and shadow domains are all
higher at the termini The average domain deletion
frequency is 9%, 7% at the nonterminal position
and 20% at the termini (Table 3) This trend cannot
be seen in the case of annotation artefacts (Fig 4,
Table 3) Furthermore, annotation artefacts are 10
times rarer than real, physical events (Table 3)
Therefore, our previous results for single-domain and
multiple deletions are scarcely affected by
inaccur-acies of the database annotations and reflect real
evolutionary events This supports the
aforemen-tioned finding that the majority of deletions are
caused by the physical deletions of protein termini
We repeated this analysis to test whether there are differences between prokaryotes and eukaryotes; how-ever, we did not find significant differences (see the Supplementary material)
Distribution of termini length in proteins
We have further pursued the question of whether the terminal deletions can be regarded as truly modular events; that is, to what extent evolution preserves domain boundaries upon domain deletion The null hypothesis is that in the case of nearly neutral evolu-tion, the domains are depleted gradually, and partially deleted domain fragments are common In such a case, the evolution of proteins cannot be modelled by the approximation of domains or modules However, sev-eral factors can make the situation different First, selection pressure could rapidly eliminate the truncated fragments – unnecessary biosynthesis of the nonfunc-tional protein fragments should reduce fitness Second,
if domain deletions are caused by genetic mechanisms preserving domain boundaries (such as gene fusions), partial domains will be rare If this is the case, amino acid sequence deletions can be simplified to domain deletion events, and thus protein evolution could be abstracted to the level of modules
We tackled this problem as follows We have con-structed clusters of proteins Each cluster contained proteins with the same domain arrangement, or with
an arrangement shortened by a terminal domain dele-tion, either N terminal or C terminal We recorded the length of the N- or C-terminal amino acid
Evolutionary events Annotation artefacts
Fig 4 Results of the protein clusters analysis: relative percentages
of different evolutionary events and annotation artefacts at different domain positions within the analysed sequences Error bars indi-cate the standard error of the calculated proportion The values for the ‘Middle position’ were averaged from the values for all non-terminal positions.
Trang 6sequence and plotted the distribution of its length
(see the Materials and methods for details) The
lengths were normalized for every protein cluster and
then averaged for evaluation A length of 0
corres-ponds to the case when the terminal domain is
com-pletely deleted, and 100 to the average length of the
terminal domain in the whole cluster Furthermore,
we refined these results by counting only the protein
sequence fragments that are similar, at the amino acid
sequence level, to the remaining sequence of the
dele-ted domain, given one of two E-value thresholds
These E-values between those fragments and the
intact domain were recorded and put in three bins,
each for a different range of E-values (any E-value,
0£ E £ 0.01; 0 £ E £ 1 · 10)5)
The distributions of termini lengths are shown in
Fig 5 The distributions show that complete domains
are much more likely to be present in proteins, and
that partial domains are rare at the terminal ends
These distributions hold also for sets of data in which
sequences containing three or fewer domains were
removed, and also in the case of Pfam domains (Fig 5, bottom) If an E-value was applied (only frag-ments similar to the given domain were considered), the shorter sequences with a terminal fragment that was completely lost were eliminated from the histo-gram This was not necessarily because the fragments were not homologous, but because the fragments were too short to show any significant similarity However, the right part of the distribution, corresponding to sequence fragments of > 50% of the average domain length, did not change significantly (grey bars on Fig 5)
Domain deletions and domain versatility Finally, we investigated whether the domain deletion events were connected to the properties of the deleted domains itself Specifically, we wished to establish whe-ther the versatility of a domain plays a role in domain deletions Furthermore, we considered that domains can, in general, fold autonomously Therefore, we
Length of the N terminus in % of the deleted domain Length of the C terminus in % of the deleted domain
Fig 5 Length distributions of the remaining fragment from a terminal domain Distribut-ion of the length of the terminal sequences
is based on comparison of domain arrange-ments alignarrange-ments Left, distribution on the N-termini; right; distribution on the C-termini.The lengths are relative to the size
of the deleted domain (¼ 100%) White bars; all terminal fragments; light grey, terminal fragments similar to the deleted domain (E < 0.01); dark grey, terminal fragments significantly similar to the deleted domain (E < 1 · 10– 5 ) Top, results for the ProDom database; bottom, results for the PfamA data set.
Table 2 Deleted domains and domain versatility.
Position Fraction as single for all a Fraction as single for deleted b Average NN for all c Average NN for deleted d
a
Overall fraction of domains that were found to form single-domain proteins;bfraction of deleted domains that were found to form single-domain proteins; c average number of neighbours for all domains in the protein clusters ± standard error; d average number of neighbours for the deleted proteins ± standard error As each of the domains in a middle position has two neighbours, the values in parentheses are the averages divided by two The results are based on a dataset with proteins having 3 or more domains.
Trang 7recorded how often domains that are lost form
single-domain proteins
First, we calculated the fraction of domains that also
occur as single-domain genes in the sets of domains
that are deleted at an N-terminal, C-terminal or
cen-tral position.We found that the domains which also
occur as single-domain proteins are found two- to four
times more frequently at the termini, and twice as
fre-quently at the C terminus than at the N terminus
(Table 2) Surprisingly, the average fraction of
domains that also occur as single-domain genes is
lower for the domains that partake in deletion events
than the average for all domains
The ability of a domain to form autonomous,
sin-gle-domain proteins may be related to its versatility
We have therefore calculated the domain connectivity
and found that it is highest for the nonterminal
domains However, as the domains at a nonterminal
position have, on average, two neighbours, whereas
the terminal domains have only one, the averages for
this type of domains must be halved In that case,
the percentages of domains that form autonomous,
single-domain proteins are higher for domains that
undergo deletions at the termini, and lower for
domains that undergo deletions at a nonterminal
position (Table 2) Again, the numbers of domains
that form autonomous, single-domain proteins are
highest for the domains that are deleted at the
C terminus
We conclude that the elevated rates of domain
dele-tions at the termini regions are partly related to
domain versatility and their ability to function outside
a multidomain protein (to form single-domain
pro-teins) The events involving domain acquisition⁄ loss
are twice as frequent at the C terminus than at the
N terminus (Table 2)
Case study: bacterial formate dehydrogenases
An exemplary cluster of bacterial formate dehydroge-nase proteins is shown in Fig 6 This cluster illustrates several modular events, including domain deletion, a substitution by a diverged sequence fragment, and ero-sion (Fig 6B) A multiple alignment of the protein sequences can be found in the Supplementary material For some of the proteins the structure is known [26]
We analysed the phylogeny of the cluster, as derived from whole protein sequences (Fig 6C) The obtained phylogenetic tree is consistent with the modifications of the domain arrangements (Fig 6D), and the revealed events can be associated with the tree nodes Significant rearrangements take place on the sixth position of the cluster where, in different proteins, we found two differ-ent ProDom domains, shadow domains and, at one position (in the protein O59078), a complete deletion Further rearrangements are found at the protein C ter-minus: two proteins have additionally two other domains The shadow domains may either be the result
of a substitution by another sequence, or by such a high accumulation of mutations in a domain that it is no longer similar to the original sequence
There are three variable regions in the domain arrangement of the protein cluster First, at position 6
in the arrangement, in some proteins there are similar sequences that were not annotated in ProDom (‘ero-sion’) or domains which were annotated differently because of high sequence divergence (‘camouflage’) Next, at position 8, there is a substitution in two of the sequences Finally, the C-terminal part is missing, truncated or eroded in many sequences, for example in the illustrated structure (Fig 6A,B)
Conclusions
Our main conclusions are as follows (a) domain dele-tion events occur frequently at either of the termini, (b) the deletions occur domain-wise; that is, in most of the cases the whole domain is lost, (c) domain losses correlate with domain versatility (i.e the number of different combinations in which a domain occurs), (d) versatile domains are more frequently found at the
C terminus and (e) clear definitions can be given to distinguish misannotations from physical deletions Eventually the question ‘What is the probability of a domain deletion?’ can only be answered using domain phylogenies However, our study shows that the dele-tion events are quite frequent; in the collected protein clusters, the frequencies of proteins in a cluster with a domain deleted at either of the termini were 9%
Table 3 Results of the analysis of protein clusters for the ProDom
database Numbers in the table correspond to the absolute
num-bers of events recorded (% of the events recorded is given in
par-enthesis).
Event
Average
(%) N-terminus middle C-terminus
Total number of
domains
Real events:
Deletions 13925 (9.2) 2998 (20.6) 8077 (6.6) 2850 (19.6)
Substitutions 3034 (2.0) 546 (3.8) 2000 (1.6) 488 (3.4)
Shadow domains 8770 (5.8) 1811 (12.5) 5399 (4.4) 1560 (10.7)
Annotation artefacts:
Camouflage 1557 (1.0) 110 (0.8) 1391 (1.1) 56 (0.4)
Erosion 1235 (0.8) 82 (0.6) 1001 (0.8) 152 (1.0)
Trang 8(Table 3), which provides a rough estimate for the
frequency of deletion events in protein–protein
com-parisons
The fact that the domain deletions are not uniformly
distributed along a protein, but that they nonetheless
follow a distinct pattern of domain deletions, is an
important conclusion in the context of constructing
algorithms for sequence alignments that take into
account domain arrangements of proteins It also
pro-vides a biological justification for choosing a lower-end
gap penalty in sequence alignment algorithms, such as
clustalw[27]
In conclusion, by analysing the versatility of deleted
domains and their ability to form single-domain
pro-teins, we have found that, while gene fusion and fission
indeed play a significant role in the deletion events at
the termini, the introduction of new start and stop
co-dons also play a major role The fraction of the
dele-ted domains that can be found as single-domain
proteins was twice as high at the C terminus (Table 2),
as was the connectivity of the C-terminally deleted
domains This suggests that in a gene fusion or fission
event, the versatile, single-domain protein is more
likely to be found at the C terminus This may be
explained by the fact that in a gene fusion⁄ fission
event, or in the case of introduction of new start and
stop codons, the N-terminal part of the coding sequence remains connected to its promoter region and regulatory sites Thus, a versatile domain that is fused with the C terminus of a much larger protein will not have an effect on the regulation of the whole protein, because it will not modify the promoter region and regulatory sites Our results suggest such a selective disequilibrium: the function (and regulation) of the protein is connected to its N-terminal part, and there-fore the fusion⁄ fission events involving smaller, versa-tile domains will occur more frequently at the
C terminus
Moreover, we have found that the event of domain deletion occurs mostly in a modular manner This can have two explanations First, the apparent domain deletion can be caused by gene fusion or fission Sec-ond, a domain fragment truncated (e.g by a nonsense mutation) that is no longer functional may be rapidly eliminated by natural selection Either way, the domain deletions effectually respect domain boundar-ies These results have further supported the emerging view that, by and large, the modular evolution of pro-teins is dominated by two major types of events: fusion, on the one hand, and deletion and fission on the other [3,4,21,28] Exon shuffling and recombination seem to be rare
Fig 6 Cluster of the bacterial formate dehy-drogenases (A,B) The structure of formate dehydrogenase H (FDHF) from Escherichia coli (C) Phylogeny of the analysed proteins obtained by the parsimony method with 100 bootstraps (D) The corresponding domain arrangements of the analysed proteins Colour code: (A) is coloured according to the ProDom annotation, with one colour for every domain Colours and arrows on (B) indicate events identified by analysis of a cluster of related proteins and correspond
to the coloured arrows on (C) and (D) The symbols on (C) show a possible attribution
of the events to tree nodes sub, substitut-ion; del, deletion ⁄ insertion; colours of the symbols correspond to the colours (B) The coloured boxes on (D) correspond to differ-ent ProDom domains and are the same as
on (A) The black thin boxes on position 6 correspond to ‘shadow domains’.
Trang 9Materials and methods
For the analyses, ProDom [22] version 2004.1 was used The
main results were confirmed using the Pfam, release17 [29]
Each database contains a number of domain arrangements,
that is, proteins annotated in terms of domains All
supple-mentary materials can be found on our web page (http://
www.uni-muenster.de/Bioinformatics/services/domdel/)
Overall single deletion statistics
Proteins from the ProDom database and, separately, from
the Pfam database, were divided into sets according to the
number of domains Each set contained all proteins with a
fixed number of domains, for example ‘set6’ contained
pro-teins with six domains
Each protein from a given set containing proteins of
length N domains was compared with each protein from
the set containing proteins of length N)1 domains For
example, a protein with six domains was compared with all
proteins that have five domains If the shorter arrangement
was identical to the longer one, with the exception of a
sin-gle, missing domain, a deletion was registered The position
of the deletion within the domain arrangement was
recor-ded For example, given the five-domain arrangement
ABDEF (where A to E are domains), it is identical to the
six-domain arrangement, ABCDEF, with the exception of
the deleted domain C
The average deletion frequency was calculated as the
number of all deletion events divided by the total number
of domains in all the examined sequences The relative
domain deletion frequency at a given domain position in a
set of proteins of a given length was defined as the number
of deletions at this position, divided by the total number of
deletions in this set
These investigations have been repeated with a
nonredun-dant data set, in which each arrangement was represented
only once That is, from a set of proteins which had the same
domain arrangement, only one representative was kept
Overall multiple deletion statistics
For each domain arrangement given, all other
arrange-ments that would be obtained by removal from the given
arrangement of one or more domains were considered For
example, if A to F are domains, and ABCDEF is the given
arrangement, then we would consider the arrangements
ABCDE, BCD, ABEF, etc
Similarity of protein arrangements
For the purpose of constructing multiple domain
arrange-ment alignarrange-ments and domain arrangearrange-ment-based
phylo-genies, we implemented the Needleman-Wunsch global
alignment algorithm [25] for protein domains, with the
parameters as defined previously [17]: match¼ 10, mis-match¼)5, gap ¼ )1
Construction of protein clusters
We constructed clusters of proteins with similarity in their domain arrangement of > 80% Only clusters that had at least six domains were considered For each protein from the ProDom database, all proteins were considered that had one domain less than the given protein If a given protein matched the examined arrangement by all but one domain,
a deletion event was recorded Starting with a single protein,
a number of hits was recorded and added to the cluster; fur-thermore, these proteins were used to obtain the next set of hits (i.e proteins that have one domain less than the protein that was used in the search) The procedure stopped for a given cluster when no further similar domain arrangements were found Only clusters containing at least 10 proteins and 10 ProDom domains were used for further analysis Additionally, the amino acid sequences of all the sequences
in the cluster were collected The resulting clusters were sub-sequently aligned with a simple multiple-domain arrange-ment alignarrange-ment algorithm (progressive alignarrange-ment) The length (in terms of domains) of a cluster was defined as the length of the multiple-domain arrangement alignment
Calculation of the relative event frequency at different domain positions in protein clusters For each of the events, e, and for each of the sets of clus-ters of a given length, l, the frequency of the event at a position, k, was defined as:
fe;k¼ ne;k=Xl
i¼1ne;i; where ne,iis the number of occurrences of the event e at the domain position i The average frequency at the middle positions (that is, all domain positions except the N- and
C termini) was calculated as:
ne;middle¼Xl1
i¼2 fe;i=ðl 2Þ:
Finally, the N-terminal, C-terminal and central position frequencies for each event were averaged for all sets of clusters
Distribution of amino acid sequence length
of the termini For each of the databases ProDom and Pfam, two sets of alignments were created: one for N-terminal deletions, and one for C-terminal deletions In each set, an alignment con-tained sequences that had one of the two types of arrange-ments: either a complete arrangement, or one in which a terminal domain was missing from the ProDom description Alignments were constructed from the whole ProDom
Trang 10database Only alignments which contained at least one
complete sequence and one sequence with a missing domain
(depending on the set, either N- or C terminal) were
consid-ered
For each alignment in each set, the average size of the
deleted domain was calculated for the proteins with the
complete arrangement To take into account the variability
of the length of the complete domain, the length of the
N-terminal fragment was definned as the length of the
amino acid sequence preceding the next domain in
the arrangements, expressed as the percentage of the
calcu-lated average length of the deleted domain in this
align-ment Finally, the distribution of these values throughout
all of the analysed alignments was calculated
References
1 Patthy L (1999) Protein Evolution Blackwell Science,
Oxford
2 Liu J & Rost B (2004) CHOP: parsing proteins into
structural domains Nucleic Acids Res 32, W569–W571
3 Bornberg-Bauer E, Beaussart F, Kummerfeld S,
Teich-mann S & Weiner J 3rd (2005) The evolution of domain
arrangements in proteins and interaction networks Cell
Mol Life Sci 62, 435–445
4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) The
relationship between domain duplication and
recombi-nation J Mol Biol 346, 355–365
5 Voge IC, Berzuini C, Bashton M, Gough J &
Teich-mann S (2004) Supra-domains: evolutionary units larger
than single protein domains J Mol Biol 336, 809–823
6 Gough J (2005) Convergent evolution of domain
archi-tectures (is rare) Bioinformatics 21, 1464–1471
7 Apic G, Gough J & Teichmann S (2001) An insight into
domain combinations Bioinformatics 17 (Suppl 1),
S83–S89
8 Wuchty S (2001) Scale-free behavior in protein domain
networks Mol Biol Evol 18, 1694–1702
9 Bornberg-Bauer E (2002) Randomness, structural
uniqueness, modularity, and neutral evolution in
sequence space of model proteins Z Phys Chem 216,
139–154
10 Madera M, Voge IC, Kummerfeld S, Chothia C &
Gough J (2004) The SUPERFAMILY database in
2004: additions and improvements Nucleic Acids Res
32, D235–D239
11 Doolittle R & Bork P (1993) Evolutionarily mobile
modules in proteins Sci Am 269, 50–56
12 Apic G, Huber W & Teichmann S (2003) Multi-domain
protein families and domain pairs: comparison with
known structures and a random model of domain
recombination J Struct Funct Genomics 4, 67–78
13 Ponting C & Russel IR (1995) Swaposins: circular
per-mutations within genes encoding saposin homologues
Trends Biochem Sci 20, 179–180
14 Ulie IS, Fliess A & Unger R (2001) Naturally occur-ring circular permutations in proteins Prot Eng 14, 533–542
15 Fliess A, Motro B & Unger R (2002) Swaps in protein sequences Proteins 48, 377–387
16 Bujnicki J (2002) Sequence permutations in the molecu-lar evolution of DNA methyltransferases BMC Evol Biol 2, 3
17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005) Rapid motif-based prediction of circular permutations
in multi-domain proteins Bioinformatics 21, 932–937
18 Apic G, Gough J & Teichmann S (2001) Domain com-binations in archaeal, eubacterial and eukaryotic pro-teomes J Mol Biol 310, 311–325
19 Bashton M & Chothia C (2002) The geometry of domain combination in proteins J Mol Biol 315, 927– 939
20 Vogel C, Bashton M, Kerrison N, Chothia C & Teich-mann S (2004) Structure, function and evolution of multidomain proteins Curr Opin Struct Biol 14, 208– 216
21 Kummerfeld S & Teichmann S (2005) Relative rates of gene fusion and fission in multi-domain proteins Trends Genet 21, 25–30
22 Corpet F, Servant F, Gouzy J & Kahn D (2000) Pro-Dom and ProPro-Dom-CG: tools for protein domain analy-sis and whole genome comparisons Nucleic Acids Res
28, 267–269
23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005) Comparative mapping of sequence-based and structure-based protein domains BMC Bioinformatics 6, 77
24 Feng D & Doolittle R (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees
J Mol Evol 25, 351–360
25 Needleman S & Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins J Mol Biol 48, 443– 453
26 Boyington J, Gladyshev V, Khangulov S, Stadtman T
& Sun P (1997) Crystal structure of formate dehydro-genase H: catalysis involving Mo, molybdopterin, sele-nocysteine, and an Fe4S4 cluster Science 275, 1305– 1308
27 Thompson J, Higgins D & Gibson T (1994) clustalw: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, posi-tion-specific gap penalties and weight matrix choice Nucleic Acids Res 22, 4673–4680
28 Weiner J 3rd & Bornberg-Bauer E (2006) Evolution of circular permutations in multi-domain proteins Mol Biol Evol 23, 734–743
29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy S, Griffiths-Jones S, Howe K, Marshal IM & Sonnhammer E (2002) The Pfam protein families data-base Nucleic Acids Res 30, 276–280