Báo cáo khoa học: Revisiting the prediction of protein function at CASP6 potx

The assessment of the performance of the methods was made difﬁcult by at least two fac-tors: a the experimentally determined function of the targets was not available at the time of asse

Trang 1

Marialuisa Pellegrini-Calace1,*, Simonetta Soro1,* and Anna Tramontano1,2

1 Department of Biochemical Sciences ‘A Rossi Fanelli’, University ‘La Sapienza’, Rome, Italy

2 Istituto Pasteur, Fondazione ‘Cenci-Bolognetti’, University ‘La Sapienza’, Rome, Italy

Modern biology strongly exploits the rapid progress in

generation of experimental data and development of

computational methods The identiﬁcation and

charac-terization of proteins on a genome-wide scale is

accomplished by proteomics projects, and

computa-tional biology plays a pivotal role in the determination

of their structure–function relationships and in the

pre-diction of their biological functions [1] A number of

computational and experimental methods have been developed in the last few years to try to address the function prediction issue using available information, from sequence homology to orthology, gene context and structural features [2,3] Despite the complexity of the relationship between protein fold and protein func-tion and the existence of proteins with multiple differ-ent functions [4,5], global structure similarity can often

Keywords

Critical Assessment of Techniques for

Protein Structure Prediction (CASP);

function prediction; protein function;

structural genomics

Correspondence

A Tramontano, Department of Biochemical

Sciences ‘A Rossi Fanelli’, University ‘La

Sapienza’, P.le Aldo Moro, 5-00185 Rome,

Italy

Fax: +39 06 4440062

Tel: +39 06 49910556

E-mail: anna.tramontano@uniroma1.it

*These authors contributed equally to this

work.

(Received 21 February 2006, revised

11 April 2006, accepted 5 May 2006)

doi:10.1111/j.1742-4658.2006.05309.x

The ability to predict the function of a protein, given its sequence and⁄ or 3D structure, is an essential requirement for exploiting the wealth of data made available by genomics and structural genomics projects and is there-fore raising increasing interest in the computational biology community

To foster developments in the area as well as to establish the state of the art of present methods, a function prediction category was tentatively introduced in the 6th edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) worldwide experiment The assessment

of the performance of the methods was made difﬁcult by at least two fac-tors: (a) the experimentally determined function of the targets was not available at the time of assessment; (b) the experiment is run blindly, pre-venting veriﬁcation of whether the convergence of different predictions towards the same functional annotation was due to the similarity of the methods or to a genuine signal detectable by different methodologies In this work, we collected information about the methods used by the various predictors and revisited the results of the experiment by verifying how often and in which cases a convergent prediction was obtained by methods based on different rationale We propose a method for classifying the type and redundancy of the methods We also analyzed the cases in which a function for the target protein has become available Our results show that predictions derived from a consensus of different methods can reach an accuracy as high as 80% It follows that some of the predictions submitted

to CASP6, once reanalyzed taking into account the type of converging methods, can provide very useful information to researchers interested in the function of the target proteins

Abbreviations

BIND, binding feature number indicating the nature of putative interacting partners; BS, binding site residue identifier; CASP, Critical Assessment of Techniques for Protein Structure Prediction; GOC, GO cellular component number; GOF, GO molecular function number; GOP, GO biological process number; PT, post-translational modification number; RESIDUE-ROLE, free comment on residues with a putative peculiar role.

Trang 2

help in assigning a function to proteins [6,7] This

abil-ity is important for the many ongoing structural

genomics projects that will provide us with more and

more structures of proteins with very low sequence

identity with proteins of known structure

The task of predicting the function of a protein is

exceptionally interesting but very challenging The

existence of paralogous relationships implies that a

common evolutionary origin does not guarantee

com-mon function [8,9] Moreover, the discovery of

moon-lighting proteins, able to perform different functions in

different conditions or environments, makes the

prob-lem even more complex [10,11]

The Critical Assessment of Techniques for Protein

Structure Prediction (CASP) [12] community

recog-nized the relevance of this issue and tried to foster

novel developments by setting up a function prediction

category in addition to the well-known structure

pre-diction ones The question addressed was whether and

in which cases computational methods are able to

pro-vide useful information about the molecular or

biologi-cal function of an unknown protein, with the aim of

providing researchers with potentially useful

informa-tion [12]

This category is intrinsically different from the

other CASP categories because, at the end of the

experiment, the function of the target protein is likely

still to be unknown However, the analysis of the

submitted predictions made the assessors conclude

that [13]:

(a) groups predicting the 3D structure of a protein

only rarely used this information to predict its

function as well, and vice versa;

(b) in a substantial fraction of cases, the same

predic-tion was submitted by different groups and

there-fore a ‘prediction consensus’ could be derived for

some targets

CASP is run blindly This implies that the assessor

should not know the identity of the predicting groups

and therefore cannot take into account the method

used for deriving a prediction Indeed it was suggested

and accepted that, in subsequent editions of the

experi-ment, a general description of the method used should

be made available to the assessor

After the experiment was concluded, we revisited the

data collected and reassessed the results, taking into

account the methods used We took advantage of the

knowledge of the identity of the predicting groups as

well as of functional annotations that have become

available in the mean time Our results show that a

basic knowledge of the methods is important for

understanding the level at which predictions can be

trusted and for assessing the reliability of the predicted

functions They also show that predictions derived by

a consensus among groups using different, non-redun-dant, methods can reach an accuracy of 80%

Results

The protein set at the beginning of CASP6 contained

87 targets, 23 of which were discarded during the experiment because of practical issues, such as early or late release of the 3D protein structure Therefore, the set that was considered contained 64 protein targets,

29 of which had no functional annotation in any data-base at the time of the experiment A function predic-tion for at least one of the 64 targets was submitted by

23 of the total 172 predictors

Within the CASP6 experiment, seven classes of tion predictions were considered: GO molecular func-tion numbers (GOFs), GO biological process numbers (GOPs), GO cellular component numbers (GOCs), binding feature numbers indicating the nature of puta-tive interacting partners (BINDs), binding site residues identiﬁers (BSs), free comments on residues with a putative peculiar role (RESIDUE-ROLEs) and post-translational modiﬁcation numbers (PTs) In the pre-sent study, the BS and RESIDUE-ROLE subsets were not analyzed because of the high variability in the type and format of the submitted predictions

We considered 1590 total function predictions sub-mitted by 18 groups for 64 protein targets, as GOFs (568), GOPs (445), GOCs (363), BINDS (150), and PTs (64)

Method classification The ﬁrst step was to recover the information about the method used by each predictor, by inspecting the abstracts submitted to CASP, performing literature searches and, in some cases, directly contacting the predictors

Each method was assigned to one or more of ﬁve categories (F1 to F5), here called features, on the basis

of the type of information used by predictors The use

of sequence information corresponded to the F1 cate-gory F2 indicated the use of structural features Meth-ods using the GO database for any reason other than deriving GO numbers for submission were assigned to F3 Literature-based methods and manual methods were indicated as F4 and F5, respectively (Table 1) Therefore, each method could be classiﬁed by a ﬁve bit binary code indicating the presence (1) or absence (0)

of each of the ﬁve features For instance, the 1000 code corresponds to completed automated (F5 ¼ 0) methods based on sequence information (F1 ¼ 1)

Trang 3

which do not take advantage of structural (F2¼ 0),

GO (F3¼ 0) and literature (F4 ¼ 0) information

The distribution of single features and binary

combi-nations are shown in Fig 1A and Fig 1B,

respect-ively

Although the possible theoretical combinations of

features are 25 (52), the F1 feature was used by all the

18 predictor groups, reducing the possible binary

com-binations to 20 The observed binary codes were only

8, showing a lower than expected variability in the combination of used information More than 50% of the possible combinations were not found, and the majority of predictors used canonical feature combina-tions In fact, all groups used sequence information, nine took advantage of literature information, and six used structure information, but only four predictors exploited the GO database and only one of them com-bined it with structural features Moreover, predictors using GO never took literature information into account and vice versa It is worth highlighting here that 11 groups used an approach developed in-house for features F1, F2 or F3

For each of the function prediction classes, we com-puted a consensus value and a redundancy value within the consensus [F(red)] (Table 2 and Fig 2) A consensus is deﬁned as the number of identical predic-tions submitted by at least two predictors The redund-ancy value F(red) indicates the method variability in terms of feature combinations within a consensus and

is calculated as follows:

F(red)¼N(red)

N(tot) where N(red) and N(tot) are the number of methods with the same binary identiﬁer, i.e the number of redundant methods, and the total number of methods generating the consensus, respectively Lower F(red) values reﬂect a higher variability in the type of meth-ods generating the consensus

For the GOF, GOP and GOC classes, we also com-puted a consensus value among the GO parents of the submitted predictions for up to three levels of the GO ontology (P1, P2 and P3), to verify whether less

speci-ﬁc consensus could be achieved by different methods The results are shown in Table 2

A consensus was never found for prediction categor-ies other than GOF, GOP and BIND, although the number of GOF and GOP consensus was signiﬁcantly

Table 1 List of features used to classify methods exploited for the

predictions.

F1 Sequence information BLAST , PFAM , INTERPRO , CHOP ,

CHIEFC , PROSITE , SMART ,

PRINTS , PHYRE , others F2 Structure information 3 D - JURY , HMAP , PROFUNC ,

COLUMBA , PHUNCTIONER

F4 Literature information

25.00

20.00

15.00

10.00

5.00

0.00

F1 (seq) F2 (str) F3 (GO) F4 (lit) F5 (man)

Own Unknown

A

4

5 2

2

1

10011 10100 10001 10000 11100 10101 11001 B

Fig 1 Distribution of method features (A) Number of methods

including F1 (sequence information), F2 (structure information), F3

(use of GO database for any other reason than deriving GO

numbers for submission), F4 (literature information), F5 (manual

intervention), number of methods developed in-house (own) and

number of methods for which no description is available

(unknown) (B) Distribution of binary method class identifiers

among the 18 methods submitting predictions (in-house developed

methods and methods for which no information is available are not

included).

Table 2 Number of GO identifiers predicted by at least two groups GOF, Number of GO function predictions; GOP, number of

GO process predictions; GOC, number of GO cellular component predictions; BIND, number of binding predictions; PT, number of post-transcriptional modification predictions; NA, not applicable The number of predictions corresponding to the ‘unknown’ annotation

is shown in parentheses.

Trang 4

lower than the number of submitted predictions per

targets (70 out of 450 and 19 out of 457 for GOF and

GOP, respectively) The fraction of targets for which a

consensus could be found is high for the BIND class,

accounting for about one third of the total submitted

predictions (31 out of 150)

About half of the consensus predictions were

gener-ated by two predictions only, except for the GOP class,

where about 40% of the consensus predictions were

obtained by three independent methods It should be

noticed that some of the consensus predictions also

included annotations such as ‘unknown molecular

func-tion’ or ‘unknown biological process’, highlighted in

parentheses in Table 2 The exclusion of ‘unknown

bio-logical process’ predictions left only two of the 19 GOP

consensus predictions that were generated by three

sub-missions corresponding to three different methods

Figure 2B shows a histogram of the fraction of

redundancy for the three functional classes

Interest-ingly, redundancy values between 0 and 0.2 were often

observed, corresponding to a variability of at least

80% in the combinations of features generating the

consensus

When annotations were grouped according to parent levels of the respective GO terms (one level, P1; two levels, P2; three levels, P3), neither the number of consensus predictions nor the fraction of redund-ancy changed signiﬁcantly (supplementary material, Fig S1) This is most likely due to the somewhat lim-ited depth of the GO graph, so that the existence of a common node between two predictions does not neces-sarily provide additional information

Target function annotation versus time

At the beginning of the CASP6 experiment, 42, 32 and

9 targets had a molecular function, a biological process and a cellular component annotation, respectively In

23 cases, information about interaction partners was available, whereas no annotation about post-transla-tional modiﬁcations was present in the databases One year later (October 2005), the available annotations decreased by 5% to 10%, showing that the knowledge

of the 3D structure of a protein allows its function annotation to be improved, even if this can just imply removing a previous annotation (supplementary mater-ial, Fig S2) In fact, for 11 targets at least one molecu-lar function annotation was either modified or deleted between the end of 2004 and the end of 2005 (supple-mentary material, Table S1) In the same period, the number of non-annotated targets decreased by 8, 6, 2 and 1 for GOF, GOP, GOC and BIND, respectively These data confirm that the process of assigning a function to proteins is still a very difficult task for both experimental and computational biologists and suggest that there is still a long way to go to fully exploit genome-scale data

Interestingly, only in about half of the cases did pre-dictions agree with function assignments that were sub-sequently removed, suggesting that taking into account different functional predictions for a given protein might be helpful in avoiding errors in database annota-tion

Predictions versus target function annotation

We can reliably assess the correctness of a prediction only for those cases where a subsequently released annotation is available for targets that had no annota-tion at the time of CASP6 (supplementary material, Fig S2) This subset was made of only 11 targets and included 24, 7, 2 and 1 GOF, GOP, GOC and BIND annotations, respectively Because of the sparseness of the data, the analysis was limited to GOF functional assignments only and included the eight targets listed

in Table 3, ﬁve of which belong to the comparative

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

N(c-preds)

GOFs GOPs BINDs

A

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0

F(red)

B

GOFs GOPs BINDs

Fig 2 (A) Fraction of consensus of prediction [F(preds)] as a

func-tion of the number of contributing predicfunc-tions [N(c-preds)] (B)

Frac-tion of consensus of predicFrac-tion [F(c-preds)] as a funcFrac-tion of the

redundancy of the contributing methods [F(red)], i.e of the number

of methods exploiting the same source of information for the

pre-diction (see text for a detailed definition).

Trang 5

model (CM) and three to the fold recognition (FR)

classes Sixty-eight groups submitted 85 predictions, 32

of which converged into 12 consensus predictions

These predictions were compared with the current

tar-get function annotation (October 2005) present in the

UniProt [13], Entrez Gene [14] and InterPro [15]

data-bases (Tables 3 and 4)

Among the predictions submitted (85 by 68 groups),

about 20% (18) were correct, i.e overlapped with the

current annotation of the corresponding targets

Inter-estingly, 17 of them were consensus predictions

More-over, although a signiﬁcantly higher number of

consensus predictions was observed for CM than for

FR targets (11 against 1, respectively), the only FR

target (T0243) consensus matched the corresponding

function annotation

Even if the size of the dataset is too small to derive general conclusions, we believe that the success rate for these predictions supports both the usefulness of the experiment and the validity of our method for deri-ving the consensus prediction Moreover it suggests that an ‘easy’ structure prediction does not necessarily correspond to an ‘easy’ function prediction The deﬁni-tion of ‘difﬁculty’ of a target has been the subject

of many debates in the structure prediction field Although clearly needed as well, equally complex is defining the difficulty of a function prediction Here

we took the view that a target not annotated in any database at the time of prediction is difﬁcult to pre-dict Given the time constraints imposed by the CASP experiment, most of the methods used in the experi-ment are automatic ones and it is unlikely that any of the predictions include a large amount of human inter-vention, unlike the case for curated databases

We feel it is inappropriate to try to derive a ranking

of the various methods on the basis of such a limited dataset, but if, as we expect, participation in future experiments increases, it will be possible to derive con-clusions about the quality of different methods More importantly, the number of consensus predictions will also increase, and this will allow a substantial number

of correct functional predictions to be produced The CASP6 assessor highlighted ﬁve cases where a consensus could be derived by comparing the different submitted predictions, although the design of the experiment did not allow the redundancy of the meth-ods to be taken into account at the time Three of these targets (T0226, T0243 and T0263) were annotated between the end of CASP6 and the time that this analy-sis was performed For T0243 and T0263, the newly deposited annotations match the consensus prediction T0243 was predicted and proved to bind DNA, and T0263 was predicted to have oxidoreductase activity

Table 3 List of targets annotated after October 2004 and the corresponding submitted predictions Target, CASP6 target identifier; Class, target classification (CM, comparative modeling; FR, fold recognition; NF, new fold); Ann DB, annotation database (EG, Entrez Gene; IP, InterPro; UP, UniProt); GOF, GO molecular function identifier, according to GO database definition; Pred (Sub) , number of predictions submit-ted; NP, number of predictors; N(Cons), number of consensus predictions found; Pred(Cons), number of predictions generating the consensus Bold, Annotations correctly predicted by at least one group.

Table 4 GO function predictions by method class (October 2005

annotated targets only) Bin, Class binary identifier; Sub Pred,

num-ber of predictions submitted by methods belonging to the class;

Sub GOFN, number of GOF submitted by methods belonging to

the class; Exact Pred, number of predictions (out of a total of 14)

corresponding to annotations found in UniProt, Entrez Gene and

InterPro databases; Exact GOFN, number of predicted GOF

num-bers (out of a total of 12) corresponding to annotations found in

UniProt, Entrez Gene and InterPro databases Numbers in

paren-theses indicate the predictions that can be clustered in terms of

common GO parents (up to three levels).

Bin

Sub Pred

(ConsP1-P3)

Sub GOFN (ConsP1-P3)

Exact Pred (ConsP1-P3)

Exact GOFN (ConsP1-P3)

Trang 6

and is indeed annotated as 3-isopropyl malate

dehy-drogenase Both consensus predictions were achieved

by predictions submitted by two different methods

(type 10100 and 10011 for T0243 and type 11011 and

11100 for T0263) For T0226, there were three

consen-sus predictions: isomerase, transferase and sugar

bind-ing; the current annotation suggests that the protein

has a structural role, which may or may not be

compat-ible with sugar binding

In the light of these ﬁndings, we can conﬁdently

conclude that consensus predictions, normalized on the

basis of redundancy of the methods, could be useful to

researchers in narrowing the number of biological

assays needed for protein function assignments and

speed up the difﬁcult and challenging functional

anno-tation process As we anticipate that this will prove to

be useful to our research colleagues, the consensus

pre-dictions for targets with no current function

annota-tion are reported in Table 5

Conclusions

The prediction of protein function is one of the major

challenges of protein bioinformatics The growing

number of completely sequenced genomes has allowed

the development of a number of new approaches

com-plementary to the use of sequence analysis, which can

be combined to elucidate complete functional networks

and biochemical pathways

The CASP community set up a new function

predic-tion category aimed at understanding whether and in

which cases computational methods are able to pro-vide useful information about the molecular or biologi-cal function of an unknown protein and to provide useful information to researchers working on the target proteins

Here we revisited the results of this CASP6 cate-gory with two aims: (a) to verify how relevant a knowledge of the methods used by predictors for assessing the results is; (b) to see to what extent infor-mation made available after the end of the experiment could contribute to our understanding of which are the best strategies for providing useful information to researchers

Our results show that consensus predictions gener-ated by diverse methods, i.e methods exploiting differ-ent sources of information, are more reliable than predictions obtained by a single method and can be used as indicators of the reliability of the prediction A general knowledge of the methods used is therefore important for understanding the level at which predic-tions can be trusted and should be made available to the CASP assessor in the next round of the experi-ment

The conclusions are based on a small number of cases as we can only use the few cases for which annotation is available now and not at the time of the CASP experiment in order to properly assess the correctness of the predictions However, it is interest-ing to note that, if we trust that the submitted

annotations, and therefore include predictions of already annotated targets in our analysis, a molecu-lar function and biological process was correctly pre-dicted for about half of the protein targets, and more than 80% of the exact predictions were within

a consensus (data not shown) On the other hand, more than 30% of the 64 CASP protein targets still have no molecular function annotation in any data-base and more than half of them have no biological process annotation Clearly therefore the assignment

of functional data to proteins is still a very difﬁcult task not only from a computational point of view, but also experimentally We hope that the present analysis, and especially the observation of the high reliability of consensus predictions, will encourage predictors to participate in the next CASP functional prediction experiments as well as convince research-ers to take into account the results in designing their experiments It is our opinion that the CASP func-tion predicfunc-tion experiment can provide a signiﬁcant contribution and promote important and useful development in the area of protein function predic-tion

Table 5 Consensus of predictions for non-annotated targets (as in

October 2005) Target, Target identifier; GOF, predicted GO

molecular function identifier; Function, molecular function

descrip-tion; NP, number of predictors; N(comb), number of different method

binary identifiers, i.e of nonredundant methods, that contribute to

the consensus.

T0222 50825 Ice binding (antifreeze activity) 4 1

T0232 4364 Glutathione transferase activity 9 4

T0237 3793 Defense (immunity protein activity) 2 1

30528 Transcription regulator activity 4 3

3700 Transcription factor activity 3 2

3677 DNA binding (functional

hypothesis: transcription factor)

Trang 7

Experimental procedures

Submitted predictions are available at the CASP web site

(http://www.predictioncenter.org)

All analyses were performed using in-house built scripts

in the PERL programming language

References

1 Wolfson HJ, Shatsky M, Schneidman-Duhovny D,

Dror O, Shulman-Peleg A, Ma B & Nussinov R (2005)

From structure to function: methods and applications

Curr Protein Pept Sci 6, 171–183

2 Jones S & Thornton JM (2004) Searching for functional

sites in protein structures Curr Opin Chem Biol 8, 3–7

3 Gabaldon T & Huynen MA (2004) Prediction of protein

function and pathways in the genome era Cell Mol Life

Sci 61, 930–944

4 Todd AE, Orengo CA & Thornton JM (1999)

Evolu-tion of protein funcEvolu-tion from a structural perspective

Curr Opin Chem Biol 3, 548–556

5 Thornton JM, Todd AE, Milburn D, Borkakoti N &

Orengo CA (2000) From structure to function:

approaches and limitations Nat Struc Biol 7, 991–994

6 Dietmann S & Holm L (2001) Identiﬁcation of

homol-ogy in protein structure classiﬁcation Nat Struct Biol 8,

953–957

7 Orengo CA, Jones DT & Thornton JM (1994) Protein

superfamilies and domain superfolds Nature 372, 631–

634

8 Devos D & Valencia A, (2000) Practical limits of

func-tion predicfunc-tion Proteins: Structure, Funcfunc-tion,

Bioinfor-matics 41, 98–107

9 Rost B (2002) Enzyme function less conserved than

anticipated J Mol Biol 318, 595–608

10 Jeffery CJ (2003) Moonlighting proteins: old proteins

learning new tricks Trends Genet 19, 415–417

11 Jeffery CJ (2003) Multifunctional proteins: examples of

gene sharing Ann Med 35, 28–35

12 Moult J, Fidelis K, Rost B, Hubbard T & Tramontano

A (2005) Proteins: Structure, Function, Bioinformatics Supplement 7, 3–7

13 Bairoch A, Apweiler R, Wu CH, Barker WC, Boeck-mann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al (2005) The Universal Protein Resource (UniProt) Nucleic Acids Res 33, D154–D159

14 Maglott D, Ostell J, Pruitt KD & Tatusova T (2005) Entrez Gene: gene-centered information at NCBI Nucleic Acids Res 33, D54–D58

15 Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bat-eman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti

L, et al (2005) InterPro, progress and status in 2005 Nucleic Acids Res 33, D201–D205

Supplementary material

The following supplementary material is available online:

Fig S1 Percentage of consensus predictions as a func-tion of the redundancy [F(red)] of the contributing methods (A) GOF functional class; (B) GOP func-tional class; (C) GOP funcfunc-tional class, ‘‘unknown bio-logical process’’ prediction excluded; (D) GOC functional class

Fig S2 (A) Number of annotated targets versus time: the dotted blue line indicates the number of annotated targets for which the annotation did not change between December 2004 and October 2005 (B) Num-ber of non-annotated targets versus time

Table S1 Submitted predictions for targets for which there was at least one GOF annotation in October

2004, subsequently removed

This material is available as part of the online article from http://www.blackwell-synergy.com

Tiêu đề	Revisiting the prediction of protein function at CASP6
Tác giả	Marialuisa Pellegrini-Calace, Simonetta Soro, Anna Tramontano
Người hướng dẫn	A. Tramontano
Trường học	University ‘La Sapienza’
Chuyên ngành	Biochemical Sciences
Thể loại	báo cáo khoa học
Năm xuất bản	2006
Thành phố	Rome

Định dạng
Số trang	7
Dung lượng	139,29 KB