1. Trang chủ
  2. » Luận Văn - Báo Cáo

Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc

11 612 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Domain deletions and substitutions in the modular protein evolution
Tác giả January Weiner 3rd, Francois Beaussart, Erich Bornberg-Bauer
Trường học Westfalian Wilhelms University of Münster
Chuyên ngành Bioinformatics
Thể loại Journal article
Năm xuất bản 2006
Thành phố Münster
Định dạng
Số trang 11
Dung lượng 512,88 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

For each protein, we searched for deletion events, that is, a pro-tein which has exactly the same domain arrangement, except for a single domain missing anywhere in the arrangement.. The

Trang 1

January Weiner 3rd, Francois Beaussart and Erich Bornberg-Bauer

Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Mu¨nster, Germany

Proteins are well known to evolve not only by point

mutations, but also by modular rearrangements [1–

3] By and large, these rearrangements occur at the

level of domains, which are independent folding units

and have been proposed to represent the unit of

modular evolution [3,4] Most domains always form

the same combinations; that is, they are always

found next to the same neighbours For example,

domains found in ribosomal proteins are not found

elsewhere and are present always in the same

con-text Also, it has been reported that many domains

appear in a very much conserved order

(suprado-mains) [5], and that the frequent occurrence of

cer-tain modular arrangements (arrangements of modules

along a sequence) across phyla is the result of con-servation [6]

While few domains co-occur with many others at least once in the same protein, most domains have few partner domains, or are even always singletons [3,7–9] Well-known examples of highly linked domains occur-ring in many different combinations are the P-loop nucleotide triphosphate hydrolase domain, the epider-mal growth factor (EGF) domain, the SH3 domain, the P-kinase domain and the domains involved in the blood clotting cascade [1,10]

The phenomenon of differential arrangements has often been termed domain mobility [11] However, this term may be misleading as it implies that single

Keywords

domain loss; fission; fusion; protein

domains; protein evolution

Correspondence

E Bornberg-Bauer, Division of

Bioinformatics, School of Biological

Sciences,The Westfalian Wilhelms

University of Mu¨nster, Schlossplatz 4,

D48149 Mu¨nster, Germany

Fax: +49 251 8321631

Tel: +49 251 8321630

E-mail: ebb@uni-muenster.de

(Received 5 December 2005, revised 13

February 2006, accepted 9 March 2006)

doi:10.1111/j.1742-4658.2006.05220.x

The main mechanisms shaping the modular evolution of proteins are gene duplication, fusion and fission, recombination and loss of frag-ments While a large body of research has focused on duplications and fusions, we concentrated, in this study, on how domains are lost We investigated motif databases and introduced a measure of protein simi-larity that is based on domain arrangements Proteins are represented as strings of domains and comparison was based on the classic dynamic alignment scheme We found that domain losses and duplications were more frequent at the ends of proteins We showed that losses can be explained by the introduction of start and stop codons which render the terminal domains nonfunctional, such that further shortening, until the whole domain is lost, is not evolutionarily selected against We demon-strated that domains which also occur as single-domain proteins are less likely to be lost at the N terminus and in the middle, than at the C ter-minus We conclude that fission⁄ fusion events with single-domain proteins occur mostly at the C terminus We found that domain substi-tutions are rare, in particular in the middle of proteins.We also showed that many cases of substitutions or losses result from erroneous annota-tions, but we were also able to find courses of evolutionary events where domains vanish over time This is explained by a case study on the bac-terial formate dehydrogenases

Abbreviations

Domain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.

Trang 2

modules or small arrangements are being transferred

from one protein to another Considering that often

two modules or larger arrangements as such are

fused into one protein, it becomes difficult to defne

which of the modules is ‘mobile’ and which is

‘sta-tic’ Therefore, it has been suggested that the term

versatility ahould be used instead of domain mobility

[3,12] Independently of the perspective taken, the

underlying mechanisms of modular rearrangements

are mostly gene fusion and domain loss and,

prob-ably to a lesser extent, domain shuffling of exons

and recombination [13–17]

While the emergence of domain combinations is well

documented [4,6,7,18–21], relatively little is known

about domain losses

In this article, we focus on how domains are lost

Ultimately, this question is difficult to discern from the

recruitment of domains because, in comparing two

proteins, phylogenetic analysis is required to detect

whether a domain has been recruited in one protein or

lost in the other To deal with this problem, we

investi-gated the possible genetic mechanisms that can cause a

domain to be lost or gained

As usual in sequence analysis, information on the

history of evolution can only be assumed a

posteri-ori, meaning that disadvantagous mutations

(frame-shifts, domain deletions, etc.) have been weeded out

by negative selection Thus, we only observe events

of modular rearrangements that are either beneficial

or neutral For the sake of comprehensiveness, we

used the ProDom database [22], which records

conserved sequence fragments However, they are not

always identical to structural domains To confer

with the general definition of domains [3], all key

results were confirmed using Pfam, which largely

agrees with structural domain definitions [23]

In the following study we first investigated whether

the relative frequencies of deletions (or recruitements)

depend on if a domain is at the end or the middle of

a protein Unless explicitly stated, we used the term

‘deletion’ as synonymous for deletions and

recruit-ments We then investigated whether eliminations are

more frequently observed at the boundaries of

domains and whether or not domain substitutions are

frequent For that purpose, we categorized and

des-cribed misannotations of domains to discern them

from real substitutions or deletions of domains Next,

we studied whether some domains are more often lost

and whether frequencies of domain deletions depend

on domain versatility Finally, we discussed the

impli-cations of our results for a wider understanding of

modular protein evolution and the possibilities for

gen-erating a model in which modular protein evolution is

formally described in terms of module edit operations and cost functions

Results and Discussion

Single domain deletions The first question we asked was whether the probabil-ity of a domain deletion is evenly distributed through-out a protein The null hypothesis was that genetic mechanisms which lead to domain deletions (for exam-ple, deletions and insertions of sequence fragments, intron recombinations, etc.) do not depend on the position within the sequence However, two factors could cause a bias First, any point mutation that cre-ates a premature stop codon will cause a C-terminal deletion of a protein Likewise, a mutation leading to the emergence of an alternative transcription or trans-lation start will cause an N-terminal deletion Second,

a fission producing two genes from one will result in the deletion of a terminal fragment from a protein or, vice versa, a fusion of two smaller proteins into one will result in the observed pattern

We first grouped proteins by the number of domains they have (see the Materials and methods) For each protein, we searched for deletion events, that is, a pro-tein which has exactly the same domain arrangement, except for a single domain missing anywhere in the arrangement Then we calculated the frequency of the deletion at each domain position within the group of proteins containing a given number of domains

We found that the domain deletions are more com-mon at either of the protein termini, and that their occurrence is slightly higher at one of the termini, depending on the number of domains in the protein and the database selected (Fig 1) The prevalence of terminal deletions did not depend on the number of domains in proteins, and the results for Pfam and Pro-Dom databases were similar In only a few cases were slightly increased frequencies of domain deletions observed at a central position

According to our predictions, this suggests that the genetic mechanism of domain deletions acts predomin-antly on sequence termini Therefore, we tentatively propose that the insertions of new transcription start and stop codons, as well as gene fusion and fission, are more likely to occur than, for example, intron mobility caused by exon shuffling

Multiple domain deletions

We supported the previous findings by analysing cases where one or more domains were deleted from a

Trang 3

protein We considered only deletions in which at least

half of the domains of the full length arrangement was

preserved, to ensure that homologous arrangements

were being compared The results were similar to those

of single domain deletions, in that the terminal

dele-tions were prevalent (see the Supplementary Material)

In many cases, a deleted domain is a part of a

lar-ger, deleted fragment We have found that fragments

deleted at either termini are, in general, much longer

than fragments deleted within a protein sequence The

deletions within the protein are much more often single

domain deletions (Fig 2) The total number of

dele-tions that concern only one, single domain, is higher

for the positions between the termini However, the

number of major deletions (deletions that span more

than one domain) is higher at terminal positions This

supports the view that the deletions generally involve

the protein termini

In-detail analysis of the deletion events

During our analyses, we noted that some of the

appar-ent domain deletions are actually just misannotations

A lack of a domain identifier at a given position in a

protein annotation does not necessarily mean that the

corresponding domain is physically deleted Likewise,

a different identifier does not necessarily signify a physical substitution To address this problem, we con-structed clusters of similar proteins that contained at

Position

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Fig 1 Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins The figure shows the relative proportion of domain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains Dark grey, Pfam; Light grey, ProDom.

Length of the deleted fragment (in domains)

Fig 2 Number of occurrences of domain deletions as a function of the length (in domains) of the deleted fragment Diamonds, N-term-inal deletions; squares, deletions within the protein; circles, C-term-inal deletions Single domain losses occur preferentially on one of the middle positions, whereas longer fragments tend to be deleted

at the termini.

Trang 4

least six ProDom domains We aligned the domain

arrangements within a cluster using a simple

progres-sive multiple alignment algorithm [24], based on

pairwise alignments generated using the

Needleman-Wunsch algorithm [25] (Supplementary material)

We were able to distinguish five types of

phenom-ena that resulted in an apparent deletion from the

domain arrangement (Table 1, Fig 3) The first two

were real substitutions and physical deletions of

domains In some cases, at the site where the domain

annotation was missing, there was, in fact, a sequence similar to the sequence of this domain However, because of length or large evolutionary distance, this sequence was not annotated by the automatic annota-tion mechanism of ProDom (‘erosion’) In other cases, if there is a high sequence variation between the instances of the domains with a given identifica-tion number (ID), homologous sequences can be assigned different ProDom identifiers (‘camouflage’) Yet, in other cases, although the annotation (ProDom

Table 1 Criteria used to distinguish between various types of sequence rearrangements and annotation artefacts that result in a disappear-ance of a domain in the domain string of a protein.

Evolutionary events

physical deletion a domain is physically deleted from the protein sequence, and only a short (<20 amino acids) fragment can

be found between the neighbouring domains substitution a domain is replaced by another domain that bears no similarity with the original domain

shadow domain at a given position, in one protein there is a ProDom domain; at the same position in another protein

there is an amino acid sequence which is not similar to the given domain and which does not correspond to a ProDom ID

Annotation artefacts

camouflage although there are two different ProDom domains at the same position in two proteins,

they are significantly similar (E<<1) erosion the domain is not annotated in ProDom, but there is at this position a similar amino acid sequence

Domain−wise evolutionary events Annotation artifacts

Substitution

A A

D B

C C

A

C

C

Shadow domain

seq

Deletion

A

C

A

D B

C C A

Camouflage

A

C C

Erosion

seq

E−value (B,D) ~ 1

E−value (B,seq) ~ 1

E−value (B,D) << 1

E−value (B,seq) << 1

Fig 3 Classification of domain-wise events observed in the domain databases Different evolutionary events (A, B, C) and annotation arte-facts (D, E) result in an apparent ‘deletion’ of a ProDom domain from a protein annotated in terms of ProDom domains Domain and dot plots can be found in the Supplementary material.

Trang 5

ID) of a given domain is missing, there is no physical

deletion or misannotation Instead, the amino acid

sequence at this position is not similar to the given

ProDom domain; therefore, it is a case of a real

sub-stitution.We call this case a ‘shadow domain’

For each of these events, we counted its

occur-rence in the constructed protein clusters (see the

Materials and methods for details), at each position

in each protein cluster, as follows If a domain was

found to be deleted from an arrangement in a

clus-ter, the amino acid sequences occurring in all the

sequences of the cluster at the given position were

analysed We have applied the criteria from Table 1

to distinguish between the three types of real

evolu-tionary events (physical domain deletion, substitution

and shadow domains) and two types of annotation

artefacts (camouflage and erosion) In the case of

physical deletions, shadow domains and erosions, the

numbers of these events were simply counted

How-ever, in the case of substitutions and camouflage, it

is not reasonable to count the number of

occur-rences of such an event without inferring a direction

of the substitution For example, if at a certain

posi-tion in a cluster, domain A occurs in two sequences,

and each of the domains B and C occurs five times,

then what frequency of the substitutions should be

assumed here? We have used the following routine:

all possible pairwise combinations of domains from

different proteins occurring at the same domain

posi-tion in a cluster were analysed If the two domains

in a pair were different, then an event (substitution

or camouflage) was recorded Therefore, the

calcula-ted numbers of substitution and camouflage events

cannot be used to infer any conclusions on the

act-ual substitution rate of domains; however, because at

all domain positions the number of camouflage and

substitution events have been calculated in the same

way, relative frequencies of the camouflage and

sub-stitution events at different positions can be inferred

The relative frequencies of physical domain

dele-tions, substitutions and shadow domains are all

higher at the termini The average domain deletion

frequency is 9%, 7% at the nonterminal position

and 20% at the termini (Table 3) This trend cannot

be seen in the case of annotation artefacts (Fig 4,

Table 3) Furthermore, annotation artefacts are 10

times rarer than real, physical events (Table 3)

Therefore, our previous results for single-domain and

multiple deletions are scarcely affected by

inaccur-acies of the database annotations and reflect real

evolutionary events This supports the

aforemen-tioned finding that the majority of deletions are

caused by the physical deletions of protein termini

We repeated this analysis to test whether there are differences between prokaryotes and eukaryotes; how-ever, we did not find significant differences (see the Supplementary material)

Distribution of termini length in proteins

We have further pursued the question of whether the terminal deletions can be regarded as truly modular events; that is, to what extent evolution preserves domain boundaries upon domain deletion The null hypothesis is that in the case of nearly neutral evolu-tion, the domains are depleted gradually, and partially deleted domain fragments are common In such a case, the evolution of proteins cannot be modelled by the approximation of domains or modules However, sev-eral factors can make the situation different First, selection pressure could rapidly eliminate the truncated fragments – unnecessary biosynthesis of the nonfunc-tional protein fragments should reduce fitness Second,

if domain deletions are caused by genetic mechanisms preserving domain boundaries (such as gene fusions), partial domains will be rare If this is the case, amino acid sequence deletions can be simplified to domain deletion events, and thus protein evolution could be abstracted to the level of modules

We tackled this problem as follows We have con-structed clusters of proteins Each cluster contained proteins with the same domain arrangement, or with

an arrangement shortened by a terminal domain dele-tion, either N terminal or C terminal We recorded the length of the N- or C-terminal amino acid

Evolutionary events Annotation artefacts

Fig 4 Results of the protein clusters analysis: relative percentages

of different evolutionary events and annotation artefacts at different domain positions within the analysed sequences Error bars indi-cate the standard error of the calculated proportion The values for the ‘Middle position’ were averaged from the values for all non-terminal positions.

Trang 6

sequence and plotted the distribution of its length

(see the Materials and methods for details) The

lengths were normalized for every protein cluster and

then averaged for evaluation A length of 0

corres-ponds to the case when the terminal domain is

com-pletely deleted, and 100 to the average length of the

terminal domain in the whole cluster Furthermore,

we refined these results by counting only the protein

sequence fragments that are similar, at the amino acid

sequence level, to the remaining sequence of the

dele-ted domain, given one of two E-value thresholds

These E-values between those fragments and the

intact domain were recorded and put in three bins,

each for a different range of E-values (any E-value,

0£ E £ 0.01; 0 £ E £ 1 · 10)5)

The distributions of termini lengths are shown in

Fig 5 The distributions show that complete domains

are much more likely to be present in proteins, and

that partial domains are rare at the terminal ends

These distributions hold also for sets of data in which

sequences containing three or fewer domains were

removed, and also in the case of Pfam domains (Fig 5, bottom) If an E-value was applied (only frag-ments similar to the given domain were considered), the shorter sequences with a terminal fragment that was completely lost were eliminated from the histo-gram This was not necessarily because the fragments were not homologous, but because the fragments were too short to show any significant similarity However, the right part of the distribution, corresponding to sequence fragments of > 50% of the average domain length, did not change significantly (grey bars on Fig 5)

Domain deletions and domain versatility Finally, we investigated whether the domain deletion events were connected to the properties of the deleted domains itself Specifically, we wished to establish whe-ther the versatility of a domain plays a role in domain deletions Furthermore, we considered that domains can, in general, fold autonomously Therefore, we

Length of the N terminus in % of the deleted domain Length of the C terminus in % of the deleted domain

Fig 5 Length distributions of the remaining fragment from a terminal domain Distribut-ion of the length of the terminal sequences

is based on comparison of domain arrange-ments alignarrange-ments Left, distribution on the N-termini; right; distribution on the C-termini.The lengths are relative to the size

of the deleted domain (¼ 100%) White bars; all terminal fragments; light grey, terminal fragments similar to the deleted domain (E < 0.01); dark grey, terminal fragments significantly similar to the deleted domain (E < 1 · 10– 5 ) Top, results for the ProDom database; bottom, results for the PfamA data set.

Table 2 Deleted domains and domain versatility.

Position Fraction as single for all a Fraction as single for deleted b Average NN for all c Average NN for deleted d

a

Overall fraction of domains that were found to form single-domain proteins;bfraction of deleted domains that were found to form single-domain proteins; c average number of neighbours for all domains in the protein clusters ± standard error; d average number of neighbours for the deleted proteins ± standard error As each of the domains in a middle position has two neighbours, the values in parentheses are the averages divided by two The results are based on a dataset with proteins having 3 or more domains.

Trang 7

recorded how often domains that are lost form

single-domain proteins

First, we calculated the fraction of domains that also

occur as single-domain genes in the sets of domains

that are deleted at an N-terminal, C-terminal or

cen-tral position.We found that the domains which also

occur as single-domain proteins are found two- to four

times more frequently at the termini, and twice as

fre-quently at the C terminus than at the N terminus

(Table 2) Surprisingly, the average fraction of

domains that also occur as single-domain genes is

lower for the domains that partake in deletion events

than the average for all domains

The ability of a domain to form autonomous,

sin-gle-domain proteins may be related to its versatility

We have therefore calculated the domain connectivity

and found that it is highest for the nonterminal

domains However, as the domains at a nonterminal

position have, on average, two neighbours, whereas

the terminal domains have only one, the averages for

this type of domains must be halved In that case,

the percentages of domains that form autonomous,

single-domain proteins are higher for domains that

undergo deletions at the termini, and lower for

domains that undergo deletions at a nonterminal

position (Table 2) Again, the numbers of domains

that form autonomous, single-domain proteins are

highest for the domains that are deleted at the

C terminus

We conclude that the elevated rates of domain

dele-tions at the termini regions are partly related to

domain versatility and their ability to function outside

a multidomain protein (to form single-domain

pro-teins) The events involving domain acquisition⁄ loss

are twice as frequent at the C terminus than at the

N terminus (Table 2)

Case study: bacterial formate dehydrogenases

An exemplary cluster of bacterial formate dehydroge-nase proteins is shown in Fig 6 This cluster illustrates several modular events, including domain deletion, a substitution by a diverged sequence fragment, and ero-sion (Fig 6B) A multiple alignment of the protein sequences can be found in the Supplementary material For some of the proteins the structure is known [26]

We analysed the phylogeny of the cluster, as derived from whole protein sequences (Fig 6C) The obtained phylogenetic tree is consistent with the modifications of the domain arrangements (Fig 6D), and the revealed events can be associated with the tree nodes Significant rearrangements take place on the sixth position of the cluster where, in different proteins, we found two differ-ent ProDom domains, shadow domains and, at one position (in the protein O59078), a complete deletion Further rearrangements are found at the protein C ter-minus: two proteins have additionally two other domains The shadow domains may either be the result

of a substitution by another sequence, or by such a high accumulation of mutations in a domain that it is no longer similar to the original sequence

There are three variable regions in the domain arrangement of the protein cluster First, at position 6

in the arrangement, in some proteins there are similar sequences that were not annotated in ProDom (‘ero-sion’) or domains which were annotated differently because of high sequence divergence (‘camouflage’) Next, at position 8, there is a substitution in two of the sequences Finally, the C-terminal part is missing, truncated or eroded in many sequences, for example in the illustrated structure (Fig 6A,B)

Conclusions

Our main conclusions are as follows (a) domain dele-tion events occur frequently at either of the termini, (b) the deletions occur domain-wise; that is, in most of the cases the whole domain is lost, (c) domain losses correlate with domain versatility (i.e the number of different combinations in which a domain occurs), (d) versatile domains are more frequently found at the

C terminus and (e) clear definitions can be given to distinguish misannotations from physical deletions Eventually the question ‘What is the probability of a domain deletion?’ can only be answered using domain phylogenies However, our study shows that the dele-tion events are quite frequent; in the collected protein clusters, the frequencies of proteins in a cluster with a domain deleted at either of the termini were  9%

Table 3 Results of the analysis of protein clusters for the ProDom

database Numbers in the table correspond to the absolute

num-bers of events recorded (% of the events recorded is given in

par-enthesis).

Event

Average

(%) N-terminus middle C-terminus

Total number of

domains

Real events:

Deletions 13925 (9.2) 2998 (20.6) 8077 (6.6) 2850 (19.6)

Substitutions 3034 (2.0) 546 (3.8) 2000 (1.6) 488 (3.4)

Shadow domains 8770 (5.8) 1811 (12.5) 5399 (4.4) 1560 (10.7)

Annotation artefacts:

Camouflage 1557 (1.0) 110 (0.8) 1391 (1.1) 56 (0.4)

Erosion 1235 (0.8) 82 (0.6) 1001 (0.8) 152 (1.0)

Trang 8

(Table 3), which provides a rough estimate for the

frequency of deletion events in protein–protein

com-parisons

The fact that the domain deletions are not uniformly

distributed along a protein, but that they nonetheless

follow a distinct pattern of domain deletions, is an

important conclusion in the context of constructing

algorithms for sequence alignments that take into

account domain arrangements of proteins It also

pro-vides a biological justification for choosing a lower-end

gap penalty in sequence alignment algorithms, such as

clustalw[27]

In conclusion, by analysing the versatility of deleted

domains and their ability to form single-domain

pro-teins, we have found that, while gene fusion and fission

indeed play a significant role in the deletion events at

the termini, the introduction of new start and stop

co-dons also play a major role The fraction of the

dele-ted domains that can be found as single-domain

proteins was twice as high at the C terminus (Table 2),

as was the connectivity of the C-terminally deleted

domains This suggests that in a gene fusion or fission

event, the versatile, single-domain protein is more

likely to be found at the C terminus This may be

explained by the fact that in a gene fusion⁄ fission

event, or in the case of introduction of new start and

stop codons, the N-terminal part of the coding sequence remains connected to its promoter region and regulatory sites Thus, a versatile domain that is fused with the C terminus of a much larger protein will not have an effect on the regulation of the whole protein, because it will not modify the promoter region and regulatory sites Our results suggest such a selective disequilibrium: the function (and regulation) of the protein is connected to its N-terminal part, and there-fore the fusion⁄ fission events involving smaller, versa-tile domains will occur more frequently at the

C terminus

Moreover, we have found that the event of domain deletion occurs mostly in a modular manner This can have two explanations First, the apparent domain deletion can be caused by gene fusion or fission Sec-ond, a domain fragment truncated (e.g by a nonsense mutation) that is no longer functional may be rapidly eliminated by natural selection Either way, the domain deletions effectually respect domain boundar-ies These results have further supported the emerging view that, by and large, the modular evolution of pro-teins is dominated by two major types of events: fusion, on the one hand, and deletion and fission on the other [3,4,21,28] Exon shuffling and recombination seem to be rare

Fig 6 Cluster of the bacterial formate dehy-drogenases (A,B) The structure of formate dehydrogenase H (FDHF) from Escherichia coli (C) Phylogeny of the analysed proteins obtained by the parsimony method with 100 bootstraps (D) The corresponding domain arrangements of the analysed proteins Colour code: (A) is coloured according to the ProDom annotation, with one colour for every domain Colours and arrows on (B) indicate events identified by analysis of a cluster of related proteins and correspond

to the coloured arrows on (C) and (D) The symbols on (C) show a possible attribution

of the events to tree nodes sub, substitut-ion; del, deletion ⁄ insertion; colours of the symbols correspond to the colours (B) The coloured boxes on (D) correspond to differ-ent ProDom domains and are the same as

on (A) The black thin boxes on position 6 correspond to ‘shadow domains’.

Trang 9

Materials and methods

For the analyses, ProDom [22] version 2004.1 was used The

main results were confirmed using the Pfam, release17 [29]

Each database contains a number of domain arrangements,

that is, proteins annotated in terms of domains All

supple-mentary materials can be found on our web page (http://

www.uni-muenster.de/Bioinformatics/services/domdel/)

Overall single deletion statistics

Proteins from the ProDom database and, separately, from

the Pfam database, were divided into sets according to the

number of domains Each set contained all proteins with a

fixed number of domains, for example ‘set6’ contained

pro-teins with six domains

Each protein from a given set containing proteins of

length N domains was compared with each protein from

the set containing proteins of length N)1 domains For

example, a protein with six domains was compared with all

proteins that have five domains If the shorter arrangement

was identical to the longer one, with the exception of a

sin-gle, missing domain, a deletion was registered The position

of the deletion within the domain arrangement was

recor-ded For example, given the five-domain arrangement

ABDEF (where A to E are domains), it is identical to the

six-domain arrangement, ABCDEF, with the exception of

the deleted domain C

The average deletion frequency was calculated as the

number of all deletion events divided by the total number

of domains in all the examined sequences The relative

domain deletion frequency at a given domain position in a

set of proteins of a given length was defined as the number

of deletions at this position, divided by the total number of

deletions in this set

These investigations have been repeated with a

nonredun-dant data set, in which each arrangement was represented

only once That is, from a set of proteins which had the same

domain arrangement, only one representative was kept

Overall multiple deletion statistics

For each domain arrangement given, all other

arrange-ments that would be obtained by removal from the given

arrangement of one or more domains were considered For

example, if A to F are domains, and ABCDEF is the given

arrangement, then we would consider the arrangements

ABCDE, BCD, ABEF, etc

Similarity of protein arrangements

For the purpose of constructing multiple domain

arrange-ment alignarrange-ments and domain arrangearrange-ment-based

phylo-genies, we implemented the Needleman-Wunsch global

alignment algorithm [25] for protein domains, with the

parameters as defined previously [17]: match¼ 10, mis-match¼)5, gap ¼ )1

Construction of protein clusters

We constructed clusters of proteins with similarity in their domain arrangement of > 80% Only clusters that had at least six domains were considered For each protein from the ProDom database, all proteins were considered that had one domain less than the given protein If a given protein matched the examined arrangement by all but one domain,

a deletion event was recorded Starting with a single protein,

a number of hits was recorded and added to the cluster; fur-thermore, these proteins were used to obtain the next set of hits (i.e proteins that have one domain less than the protein that was used in the search) The procedure stopped for a given cluster when no further similar domain arrangements were found Only clusters containing at least 10 proteins and 10 ProDom domains were used for further analysis Additionally, the amino acid sequences of all the sequences

in the cluster were collected The resulting clusters were sub-sequently aligned with a simple multiple-domain arrange-ment alignarrange-ment algorithm (progressive alignarrange-ment) The length (in terms of domains) of a cluster was defined as the length of the multiple-domain arrangement alignment

Calculation of the relative event frequency at different domain positions in protein clusters For each of the events, e, and for each of the sets of clus-ters of a given length, l, the frequency of the event at a position, k, was defined as:

fe;k¼ ne;k=Xl

i¼1ne;i; where ne,iis the number of occurrences of the event e at the domain position i The average frequency at the middle positions (that is, all domain positions except the N- and

C termini) was calculated as:

ne;middle¼Xl1

i¼2 fe;i=ðl  2Þ:

Finally, the N-terminal, C-terminal and central position frequencies for each event were averaged for all sets of clusters

Distribution of amino acid sequence length

of the termini For each of the databases ProDom and Pfam, two sets of alignments were created: one for N-terminal deletions, and one for C-terminal deletions In each set, an alignment con-tained sequences that had one of the two types of arrange-ments: either a complete arrangement, or one in which a terminal domain was missing from the ProDom description Alignments were constructed from the whole ProDom

Trang 10

database Only alignments which contained at least one

complete sequence and one sequence with a missing domain

(depending on the set, either N- or C terminal) were

consid-ered

For each alignment in each set, the average size of the

deleted domain was calculated for the proteins with the

complete arrangement To take into account the variability

of the length of the complete domain, the length of the

N-terminal fragment was definned as the length of the

amino acid sequence preceding the next domain in

the arrangements, expressed as the percentage of the

calcu-lated average length of the deleted domain in this

align-ment Finally, the distribution of these values throughout

all of the analysed alignments was calculated

References

1 Patthy L (1999) Protein Evolution Blackwell Science,

Oxford

2 Liu J & Rost B (2004) CHOP: parsing proteins into

structural domains Nucleic Acids Res 32, W569–W571

3 Bornberg-Bauer E, Beaussart F, Kummerfeld S,

Teich-mann S & Weiner J 3rd (2005) The evolution of domain

arrangements in proteins and interaction networks Cell

Mol Life Sci 62, 435–445

4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) The

relationship between domain duplication and

recombi-nation J Mol Biol 346, 355–365

5 Voge IC, Berzuini C, Bashton M, Gough J &

Teich-mann S (2004) Supra-domains: evolutionary units larger

than single protein domains J Mol Biol 336, 809–823

6 Gough J (2005) Convergent evolution of domain

archi-tectures (is rare) Bioinformatics 21, 1464–1471

7 Apic G, Gough J & Teichmann S (2001) An insight into

domain combinations Bioinformatics 17 (Suppl 1),

S83–S89

8 Wuchty S (2001) Scale-free behavior in protein domain

networks Mol Biol Evol 18, 1694–1702

9 Bornberg-Bauer E (2002) Randomness, structural

uniqueness, modularity, and neutral evolution in

sequence space of model proteins Z Phys Chem 216,

139–154

10 Madera M, Voge IC, Kummerfeld S, Chothia C &

Gough J (2004) The SUPERFAMILY database in

2004: additions and improvements Nucleic Acids Res

32, D235–D239

11 Doolittle R & Bork P (1993) Evolutionarily mobile

modules in proteins Sci Am 269, 50–56

12 Apic G, Huber W & Teichmann S (2003) Multi-domain

protein families and domain pairs: comparison with

known structures and a random model of domain

recombination J Struct Funct Genomics 4, 67–78

13 Ponting C & Russel IR (1995) Swaposins: circular

per-mutations within genes encoding saposin homologues

Trends Biochem Sci 20, 179–180

14 Ulie IS, Fliess A & Unger R (2001) Naturally occur-ring circular permutations in proteins Prot Eng 14, 533–542

15 Fliess A, Motro B & Unger R (2002) Swaps in protein sequences Proteins 48, 377–387

16 Bujnicki J (2002) Sequence permutations in the molecu-lar evolution of DNA methyltransferases BMC Evol Biol 2, 3

17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005) Rapid motif-based prediction of circular permutations

in multi-domain proteins Bioinformatics 21, 932–937

18 Apic G, Gough J & Teichmann S (2001) Domain com-binations in archaeal, eubacterial and eukaryotic pro-teomes J Mol Biol 310, 311–325

19 Bashton M & Chothia C (2002) The geometry of domain combination in proteins J Mol Biol 315, 927– 939

20 Vogel C, Bashton M, Kerrison N, Chothia C & Teich-mann S (2004) Structure, function and evolution of multidomain proteins Curr Opin Struct Biol 14, 208– 216

21 Kummerfeld S & Teichmann S (2005) Relative rates of gene fusion and fission in multi-domain proteins Trends Genet 21, 25–30

22 Corpet F, Servant F, Gouzy J & Kahn D (2000) Pro-Dom and ProPro-Dom-CG: tools for protein domain analy-sis and whole genome comparisons Nucleic Acids Res

28, 267–269

23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005) Comparative mapping of sequence-based and structure-based protein domains BMC Bioinformatics 6, 77

24 Feng D & Doolittle R (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees

J Mol Evol 25, 351–360

25 Needleman S & Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins J Mol Biol 48, 443– 453

26 Boyington J, Gladyshev V, Khangulov S, Stadtman T

& Sun P (1997) Crystal structure of formate dehydro-genase H: catalysis involving Mo, molybdopterin, sele-nocysteine, and an Fe4S4 cluster Science 275, 1305– 1308

27 Thompson J, Higgins D & Gibson T (1994) clustalw: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, posi-tion-specific gap penalties and weight matrix choice Nucleic Acids Res 22, 4673–4680

28 Weiner J 3rd & Bornberg-Bauer E (2006) Evolution of circular permutations in multi-domain proteins Mol Biol Evol 23, 734–743

29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy S, Griffiths-Jones S, Howe K, Marshal IM & Sonnhammer E (2002) The Pfam protein families data-base Nucleic Acids Res 30, 276–280

Ngày đăng: 19/02/2014, 07:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm