1. Trang chủ
  2. » Luận Văn - Báo Cáo

Tài liệu Báo cáo khoa học: A knowledge-based potential function predicts the specificity and relative binding energy of RNA-binding proteins ppt

14 736 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề A knowledge-based potential function predicts the specificity and relative binding energy of RNA-binding proteins
Tác giả Suxin Zheng, Timothy A. Robertson, Gabriele Varani
Trường học University of Washington
Chuyên ngành Chemistry, Biochemistry
Thể loại báo cáo khoa học
Năm xuất bản 2007
Thành phố Seattle
Định dạng
Số trang 14
Dung lượng 1,2 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

However, we have previously dem-onstrated that a statistical hydrogen bonding potential can discriminate native structures of protein–RNA complexes from docking decoy sets [17].. This po

Trang 1

specificity and relative binding energy of RNA-binding

proteins

Suxin Zheng1,*, Timothy A Robertson2,* and Gabriele Varani1,2

1 Department of Chemistry, University of Washington, Seattle, WA, USA

2 Department of Biochemistry, University of Washington, Seattle, WA, USA

The sequence-specific recognition of RNA by proteins

plays a fundamental role in gene expression by

direct-ing different cellular RNAs to specific processdirect-ing

path-ways or subcellular locations Many experimental

studies have explored the molecular basis for the

sequence dependence of protein–RNA recognition [1–

4]; more recently, a few studies have explored this

prob-lem from a computational perspective as well [5–16]

However, these early studies have emphasized

qualita-tive descriptions of the recognition process; relaqualita-tively

few attempts have been made to quantify the

character-istics of protein–RNA interactions using computational

approaches [17] Here, we present a new approach for

predicting the specificity of RNA-binding proteins and

to evaluate the contribution of individual amino acids

to the energetic of protein–RNA complexes

Knowledge-based potential functions have been

employed in protein structure prediction [18–27], as

well as in the prediction of protein–protein [25,28–30] and protein–ligand interactions [30–33] A few studies have explored the use of knowledge-based methods for the prediction of protein–DNA interactions from structure [30,34,35] More recently, our group [36] and others [37] have independently demonstrated that knowledge-based potentials can provide quantitative descriptions of protein–DNA interfaces comparable to those provided using molecular mechanics force fields [37]

The relative scarcity of high-resolution structures of protein–RNA complexes has represented an under-standable barrier to the quantitative application of computational approaches to the problem of protein– RNA recognition However, we have previously dem-onstrated that a statistical hydrogen bonding potential can discriminate native structures of protein–RNA complexes from docking decoy sets [17] As hydrogen

Keywords

distance-dependent potential; protein–RNA

interaction; RRM recognition; statistical

potential

Correspondence

G Varani, Department of Chemistry and

Department of Biochemistry, University of

Washington, Seattle, WA 98195, USA

Fax: +1 206 685 8665

Tel: +1 206 543 7113

E-mail: varani@chem.washington.edu

*These authors contributed equally to this

work

(Received 25 July 2007, revised 22

Septem-ber 2007, accepted 19 OctoSeptem-ber 2007)

doi:10.1111/j.1742-4658.2007.06155.x

RNA–protein interactions are fundamental to gene expression Thus, the molecular basis for the sequence dependence of protein–RNA recognition has been extensively studied experimentally However, there have been very few computational studies of this problem, and no sustained attempt has been made towards using computational methods to predict or alter the sequence-specificity of these proteins In the present study, we provide a distance-dependent statistical potential function derived from our previous work on protein–DNA interactions This potential function discriminates native structures from decoys, successfully predicts the native sequences recognized by sequence-specific RNA-binding proteins, and recapitulates experimentally determined relative changes in binding energy due to muta-tions of individual amino acids at protein–RNA interfaces Thus, this work demonstrates that statistical models allow the quantitative analysis of protein–RNA recognition based on their structure and can be applied to modeling protein–RNA interfaces for prediction and design purposes

Abbreviations

KH, K homology; MD, molecular dynamics; PDB, Protein Data Bank; RRM, RNA recognition motif; SRP, signal recognition particle.

Trang 2

bonds represent only approximately 25% of contacts

between protein and RNA [12], we reasoned that a

more comprehensive approach would describe these

interactions more effectively

In the present study, we report the application of an

all-atom, distance-dependent statistical potential to the

prediction of sequence-specific recognition between

proteins and RNA We demonstrate that this approach

can discriminate native structures of complexes from

even close docking decoys, recapitulate experimentally

determined relative binding energies (DDGs) for several

protein–RNA complexes, and predict the RNA

sequences recognized by a number of different RNA

recognition motif (RRM) and K homology (KH)

domains These results demonstrate that statistical

models can be applied to problems requiring the

high-resolution modeling of protein–RNA interactions The

anticipated future enrichment of the structural

data-base will further improve the predictive performance

of the potential

Results

The all-atom distance potential is constructed from the

distribution of interatomic distances observed in the

high resolution (< 2.5 A˚) structures of protein–RNA

complexes deposited in the Protein Data Bank (PDB)

In this approach, the ‘correctness’ of a protein–RNA

structure is assumed to be approximated by the sum of

the probabilities of observing the set intermolecular

distances defined in the 3D structure, relative to the

likelihood of encountering such distance in the dataset

of all protein–RNA structures This kind of method

was proposed by Sipple [20], and has been applied to

protein structure prediction, protein–protein and

pro-tein–ligand interactions [18–33], as well as to protein–

DNA recognition [30,34–37] The distance-dependent

statistical potential used here for protein–RNA

inter-faces is essentially identical to the score recently

described by us for protein–DNA complexes [36] The

primary difference is the introduction of a new

pseud-count correction, where an optimized number of

pseudocounts are added to the observed counts for

each atom pair (for additional details, see

Experimen-tal procedures) As a control, we also tested a simple

contact-counting method, wherein every contact

between protein and RNA (within a given distance

cut-off) was assigned the same score of)1

Docking decoy discrimination

An important property of any potential function is its

ability to discriminate cognate (native

crystallographi-cally determined structures) from noncognate (decoy) structures [38] As a preliminary test of our method, and a direct comparison with previous work, we used our distance-dependent potential to evaluate five sets

of docking decoys generated for the application of the rosetta physical potential function to protein– RNA interactions [17] These decoys were created using a combination of rigid-body docking and pro-tein side-chain repacking, and range in rmsd (relative

to the native structure) from 0.2 A˚ to over 20 A˚ Thus, they represent a solid basis for comparison to

a much more complex scoring method (the multiterm, hybrid physical⁄ statistical potential function used by rosetta)

When scored with the distance-dependent potential, the native complex can always be identified as the best structure in each of the five decoy sets (Fig 1), even for decoys that are very close to the native structure The native structure Z-scores for these decoy sets are shown in Table 1 These values indicate a strong dis-criminatory ability, comparable to that reported by Chen et al [17] using their significantly more complex scoring method Overall, the distance potential (using

a 6 A˚ cut-off) results in a mean native Z-score of )5.45, versus the value of )6.37 obtained by Chen

et al [17] (Table 1); this difference is statistically insig-nificant (P¼ 0.53, Welch’s two-sided t-test), indicating that the two methods are equivalent

When we investigated protein–DNA complexes using the same approach, we demonstrated that the all-atom potential outperformed a reduced atom description, where relevant groups were grouped according to their chemical similarity (as described in the Experimental procedures) [36] Given the relative sparsity of the structural database, we investigated whether a reduced-atom representation would not lead

to improved performance in the protein–RNA case The all-atom potential performs better than the reduced atom potential (mean Z-score )5.45 versus )4.66; see also supplementary Table S4), although the difference is not as striking as for protein–DNA com-plexes We believe this is due to less favorable statistics (fewer structures of protein–RNA complexes) We anticipate that the increasing availability of protein– RNA structures, together with the availability of data

on specificity, will further improve the performance of the knowledge-based predictive method presented here

We retained the all atom representation because it is already slightly better than the reduced atom approach

The protein–RNA score has distinctive properties compared to the protein–DNA potential When we scored the protein–RNA decoy set using the protein–

Trang 3

DNA potential, the average Z-score was

approxi-mately half that obtained with the protein–RNA

potential ()2.84 versus )5.45; see also supplementary

Table S4) Thus, although the chemistry of RNA and

DNA are very similar, the structure of RNA allows

for different interactions between proteins and the two

nucleic acids that are reflected in this result

To investigate whether the statistical potential is not simply reflecting the size of an interface or the number

of intermolecular contacts, we also used a very simple contact-counting potential to evaluate the same decoys;

in this method, the fitness of an interface is evaluated

by counting the number of close approaches between the protein and RNA Satisfactorily, this method was

A B

C

E

D

Fig 1 Score–rmsd plots for the five docking decoy sets generated by Chen et al [17]; the score generated by the distance-dependent potential (in arbitrary units) is plotted versus the deviation from the native structure (open circle at rmsd ¼ 0) (A) Poly A-binding protein in complex with polyadenylate RNA (PDB code: 1CVJ) (B) Nova-2 KH RNA-binding domain 3 (PDB code: 1EC6) (C) HuD protein in complex with AU-rich RNA (PDB code: 1FXL) (D) Human SRP19 in complex with human SRP RNA (PDB code: 1JID) (E) Human U1A protein in com-plex with U1 snRNA hairpin (PDB code: 1URN) Close-up views of near-native decoys (0–3 A ˚ rmsd) are shown in the insets.

Trang 4

much less effective, providing an average Z-score of

)2.64, less than half of the average native Z-score

found using the distance potential (Table 1)

Interestingly, the magnitude of the observed

Z-scores declines significantly as the contact cut-off is

increased from 6 A˚ to 10 A˚ and then to 12 A˚ (see

sup-plementary Table S5), suggesting that short-range

con-tacts provide the bulk of the discriminatory power in

this test This result suggests that protein–RNA

recog-nition specificity is primarily determined by

short-range intermolecular contacts Long-short-range effects (e.g

nonlocal electrostatics) appear to play a more limited

role, at least in decoy discrimination

To test the discrimination ability of the potential for

near native decoys, we next compared its ability to

discriminate near-native protein–RNA structures with

that of the force field implemented in the amber 8

molecular simulation package We generated

near-native protein–RNA decoys for 21 protein–RNA

complexes by conducting molecular dynamics (MD)

simulations of the native complexes, and by selecting

multiple time-steps from the resulting trajectories for

each structure We then scored these structures using

the distance-dependent potential function, and

exam-ined the correlations between distance scores and

amberenergies for each decoy set

This is a difficult test of score performance because

the structures are very close to native Indeed, neither

the distance-dependent score, nor the amber potential

appears to be able to discriminate native structures

from these very near-native, MD-generated decoys

(average Z-score of )0.69 versus )0.59; Table 2)

Although there is no correlation of the either score

with rmsd, the distance-dependent statistical potential

is somewhat correlated (average R2¼ 0.41) with the

energy values predicted by the amber force field Thus,

it remains very difficult for either approach to discrim-inate the native structure from structures that are close

to it in energy

Identifying RNA-binding sequences from structure

Having established the performance of the statistical potential function in decoy discrimination, we investi-gated the ability of the potential to perform tasks rele-vant to its intended application First, we sought to evaluate whether the potential could predict the cog-nate recognition sequences of RNA-binding proteins This is a particularly important problem because sequence specificity is known for only a fraction of all RNA-binding proteins The ability to predict (or at least narrow down) the cognate sequence for ‘orphan’ RNA-binding proteins would greatly facilitate the design of biological experiments aimed at dissecting the function of these proteins It is also a problem that

is not well suited for MD approaches because of the demanding computational requirements

This application relies on a specific structural model of RNA recognition by RRM and KH

Table 1 Native Z-scores and score–rmsd correlation coefficients

for the protein–RNA docking decoy sets prepared by Chen et al.

[17]

Z-scores

Distance-dependent a Coulomb b

ROSETTA +

HB c

Contact count a

Mean ± SD )5.45 ± 1.76 )1.31 ± 0.18 )6.37 ± 2.58 )2.64 ± 0.84

a Using a 6 A ˚ contact cut-off b From Chen et al [17] and referring to

a potential lacking the directional component of hydrogen bonding

(HB) interactions.cFrom Chen et al [17] and referring to the

com-plete potential function.

Table 2 Z-scores and correlations for near-native decoys generated

by MD simulation.

Largest rmsd (A ˚ )

Z-scores

Distance-dependent versus AMBER (R 2 )

Distance-dependent AMBER

Mean ± SD )0.69 ± 1.28 )0.59 ± 1.94 0.41 ± 0.15

Trang 5

domains involving four nucleotides, as detailed in the

Experimental procedures This model is strongly

sup-ported by previous research on the mechanism of

RNA recognition for RRM proteins [6,39,40] and by

the structure of existing KH domains bound to RNA

[6,41–44] As a consequence of the assumptions of

the model, complexes containing two RNA-binding

domains were divided into independent structures

(e.g 1CVJ_1 and 1CVJ_2 represent the first and

sec-ond Poly A binding protein domain of structure

1CVJ, respectively), and the two domains were

con-sidered structurally and thermodynamically unrelated

Because the model assumes that each RRM and KH

domain binds to each of four nucleotides

indepen-dently, we generated a set of 44 (256) different

structures for each protein–RNA complex by

compu-tationally ‘threading’ all possible four-nucleotide

com-binations onto the RNA bases nearest the center of

the b-sheet structure of the RRM We then scored

these sequence-variant structures with the

distance-dependent potential function

Figure 2 shows the results of this analysis If the

potential and model of recognition were perfect, and if

each structure was sequence-specific and corresponded

to the most favorable sequence recognized by a given

domain, the cognate sequences of the tested structures

would be expected to rank as number 1 Because it is

unlikely that the cognate recognition sequences for all

domains will be consistently assigned the best score,

we expressed sequence-discrimination performance in

terms of percentiles (where perfect discrimination of

the cognate recognition sequence would result in a

percentile score of 100) Remarkably, we found that 18

of the 29 tested RRM and KH domain complexes had their cognate recognition sequence ranked above the 90th percentile (i.e had better than ten-fold enrich-ment for the correct sequence) Furthermore, the distance-dependent potential ranks the cognate recog-nition sequences of the protein–RNA complexes in our test set above the 90th percentile, on average By con-trast, when we performed the same test using a simple counting potential as a control (Fig 2), the average rank was the 41st percentile

Among successful examples of binding-sequence discrimination, the native sequences of the RRM1 of Sex-lethal protein (1B7F_1) and KH1 domain of Poly C-binding protein-2 were both ranked first out

of 256 sequences, whereas KH domain 3 of hnRNP K (1ZZI), RRM of U2B¢ protein (1A9N) and RRM 4 of Polypyrimidine Tract Binding protein (2ADC_1) each had their cognate recognition sequences ranked in the top 3 (Supplemental Table S2) However, prediction was less successful for other RRM domains, such as the U1A complex (the cognate recognition sequence of U1A protein was ranked at 30) This result is none-theless not too surprising due to the noncanonical, seven-nucleotide recognition sequence (AUUGCAC) recognized by U1A that makes an unusually specific and strong interaction with RNA, unparalleled in other known RRMs [45] Relatively poor results were also obtained for the Poly A binding potein (1CVJ_1, rank 19), and for RRM1 of the HuD protein (1FXL_1, rank 32) Both Pab and HuD utilize two domains to achieve sequence-specific recognition in a cooperative manner and do not discriminate well between sequences that are related to their cognate rec-ognition motif (A-rich and AU-rich sequences, respec-tively) [46] Notably, however, the nonsequence-specific RNA helicase protein (PDB code: 2DB3, included as a negative control) had an expectedly poor cognate sequence rank of 226⁄ 256

Estimating experimentally determined relative RNA-binding affinities

A second very important property of any potential function is the ability to recapitulate the sequence dependence of experimental binding energies; this is a prerequisite if the potential is to be applied to prob-lems of protein–RNA interface prediction or design Fortunately, a few structures have a relatively dense set of experimentally determined binding constants for interface mutations We used these experimentally characterized mutants to create a set of computation-ally ‘mutated’ structures of the complexes (Table 3),

Fig 2 Structure-based identification of RRM recognition

sequen-ces The cognate sequence is ranked by the distance potential

(cut-off ¼ 6 A˚) for RRM ⁄ KH domain proteins The red line

repre-sents the rank of cognate recognition sequences using the

contact-counting score; the blue line represents the rank of these

sequences using the distance-dependent potential The points in

each colored line are sorted independently by rank; the x-axis is the

sort order The dashed line represents the 10th percentile.

Trang 6

and have scored these structures using the

distance-dependent statistical potential

A first very instructive example is provided by

mutants of bacteriophage MS2 coat protein [47,48]

Starting with the crystal structure of the complex

between MS2 coat protein and the cognate RNA

hair-pin (PDB code: 1ZDI), a series of structures were

gen-erated, representing the RNA and protein mutants for

which binding constants are reported in the literature

Then the distance-dependent potential scores for these

structures were compared with the known binding

con-stants for each mutation Unfortunately, when all of

the MS2 mutations were considered together, a poor

correlation was observed between distance score and

experimental binding affinities (data not shown)

How-ever, excellent correlations were obtained between

these values when the binding-affinity data were

divided into two subsets (Table 3, Fig 3) A first set

corresponds to complexes where the bound RNA

hair-pin contained adenine, guanine or uridine base at

posi-tion )5; the second set contains instead protein

mutants where the bound RNA contained a cytosine

at this position Within each sets of mutants, the

corre-lation between distance score and experimental binding

affinity is strong (R2¼ 0.65, Fig 3A; R2¼ 0.97,

Fig 3B), and statistically significant at the 95%

confi-dence level Figure 3C shows a likely explanation for

this result: an intramolecular hydrogen bond formed

by the cytosine at position )5 [47] When this

nucleo-tide is mutated to any other base, the intramolecular

hydrogen bond is lost, leading to a reorganization of

the RNA structure

This result does not provide direct information on the relative contribution of that hydrogen bond to the overall binding energy; it is simply implied that

Table 3 Correlations between the distance-dependent score and

the experimental free energy of binding for several mutant protein–

RNA complexes.

Distance-dependent Contact counting

6 A˚ 10 A˚ 12 A˚ 6 A˚ 10 A˚ 12 A˚ Protein mutations

MS2 (no cytosine

at position )5)

0.43 0.50 0.65 0.19 0.10 0.08 MS2 mutations

(with cytosine at )5)

0.81 0.81 0.97 0.43 0.14 0.09

RNA mutations

SRP; 2¢-OH mutations 0.87 0.56 0.52 0.36 0.30 0.29

SRP; base mutations )0.07 )0.03 )0.07 0.01 0.07 0.05

a

The native U1A complex was included in the training set for this

experiment b The U2B¢ complex (U1A homolog) was included in

the training set for this experiment.

A

B

C

Fig 3 Correlation between scores generated by the distance-dependent statistical potential and experimental binding free ener-gies (logK d ) for mutants of the MS2 coat protein (A) Complexes between protein mutants and RNA-containing nucleotides other than cytosine at position )5 (B) Complexes between protein mutants and RNA containing cytosine at position )5 (C) The char-acteristic intramolecular hydrogen bond between the amino group

of C5 and the O1P atom of U6 observed in the structure of the MS2–RNA complex containing a cytosine at position )5 that helps organize the RNA structure for protein binding [47].

Trang 7

mutations must be segregated into two groups to

obtain a clear correlation between experimental and

predicted relative affinities The most likely

explana-tion for this result is that, at present, the statistical

potential does not consider RNA intramolecular

con-tacts; therefore, contributions to binding energy due to

changes in RNA structure (i.e that occur when that

hydrogen bond is lost) cannot be captured by our

cur-rent approach

A second example that reinforces our interpretation

of the results obtained with MS2 is provided by Fox-1

protein, which regulates alternative splicing of

tissue-specific exons by binding to the GCAUG sequence

[49] The structure of the complex (PDB code: 2ERR)

and the experimental binding constants for two sets of

related mutations have been reported [49]: one set for

mutations on the Fox-1 protein and a second set for

mutations to its target RNA molecule A moderately

strong correlation was observed between the distance

score and the protein mutation data (R2¼ 0.46,

Fig 4), but an anticorrelation was observed for the set

of RNA mutations (R2¼)0.57; Table 3) As in the

previous case, this result reflects the failure of the

current statistical potential to capture the energetic

contribution associated with the disruption of RNA

intramolecular interactions that are a characteristic of

this complex [49]

A third example is human U1A protein (PDB code:

1URN), a great model for the RRM superfamily

because of the availability of NMR and

crystallo-graphic structures [50,51], as well as binding data

In this case, we observed poor correlations between

the distance-dependent score and the experimentally

determined dissociation constants (Kd) [52] when we

conducted a test using a training set of strictly

non-homologous protein–RNA structures Initially, we

assumed that this observation would reflect the very

large and energetically significant conformational

changes that have been observed in the RNA and

protein upon complex formation [53] However, when

the U1A complex itself was included in the training

set, we obtained moderate to strong correlations (R2

values between 0.27 and 0.65, depending on the

choice of distance cut-off) This suggests that U1A

binds to RNA by forming intermolecular interactions

that are not commonly observed in the database of

training structures This hypothesis is supported by

the observation that the inclusion of a close U1A

homolog (the U2B¢–U2A¢ complex) in the training set

improves the results of this test as well (R2 increases

from 0.04 to 0.39; Table 3) Thus, it appears that the

structure of the U1A or of its homologous complex

contains a set of protein–RNA atomic contacts (i.e

interatomic distances) that are not well represented in the 71 other protein–RNA complexes in our training set

Figure 5 shows the final example, a universally con-served component of the core of the signal recognition particle (SRP) The structure of the complex (PDB code: 1HQ1) and the binding affinity of a series of RNA mutants have been determined [54] The distance potential results in scores that correlate significantly (R2¼ 0.52, P £ 0.05) with experimental binding affini-ties for mutations involving substitutions of deoxy-nucleotides for their corresponding ribodeoxy-nucleotides However, as observed for Fox-1, no significant

Ade-4

Cyt-3

Ura-1 Gua-2

A

B

Fig 4 (A) Correlation between scores generated by the distance-dependent statistical potential and experimental binding free ener-gies (logKd) for mutants of the Fox-1 protein (B) The intramolecular hydrogen bond between uracil 1 and cytosine 3, and the non-Wat-son–Crick base pair between guanine 2 and adenine 4 for the RNA

in complex with Fox-1 protein (PDB code: 2ERR) The protein is represented in yellow; the RNA structure is colored by atom type.

Trang 8

correlation was found for mutations of nucleotides

that disrupt critical RNA intramolecular interactions

In this final case, these mutations involve the

disrup-tion of base pairs near the binding interface that define

the secondary structure of the RNA, which is

obvi-ously important for recognition, but do not contribute

directly to the formation of intermolecular contacts

[54]

Disscussion

The central role of protein–RNA interactions in

the regulation of gene expression has led to

consider-able interest in the biochemical processes underlying

these interactions [55–57] However, much of this

research has been devoted to the study of the

struc-ture⁄ function relationship for individual protein–

RNA complexes, and little effort has been made to

develop quantitative models that might describe

these interactions more comprehensively Thus, our

understanding of the mechanisms driving protein–

RNA recognition is still largely descriptive [11]

Recent work on protein–DNA interactions has

shown that quantitative models of protein–nucleic

acid recognition can provide insight into the

mecha-nisms of gene regulation [58,59], and, in the not too

distant future, promise to allow the rational design

of DNA-binding proteins with altered specificity [60]

The development of computational tools capable of

predicting the specificity of RNA-binding proteins

across entire families (such as the RRM

superfam-ily), or of redesigning the specificity of these

pro-teins, would be of equal importance in dissecting

post-transcriptional regulatory mechanisms, and in

providing new tools to interrogate gene expression

pathways

In a previous study, our group demonstrated that a statistical potential function could be surprisingly accu-rate when used to predict protein–DNA interactions from structure [36]; this result was corroborated by a similar study published concurrently by another group [37] Given these results, we hypothesized that the same approach would be equally successful with pro-tein–RNA interfaces Indeed, although various statisti-cal techniques have been used by a number of groups for the prediction of protein structures, protein–DNA and protein–ligand interactions [18–35], such an approach has never been applied to protein–RNA interactions

In the present study, we describe the successful application of the distance-dependent, all-atom statis-tical potential function to the prediction of the ener-getics and recognition specificity of protein–RNA interactions We demonstrate that the statistical potential can recapitulate experimentally determined relative binding constants for a number of protein– RNA complexes (with the caveat that it cannot yet capture the effect of mutations on RNA–RNA inter-actions) We also demonstrate that this simple tech-nique is remarkably successful at predicting the cognate recognition sequences of a wide variety of RNA-binding proteins

The challenge of near native decoy discrimination

The statistical potential performs very well in classi-cal decoy discriminations tests It is quite remarkable that similar Z-scores in tests of decoy discrimination are obtained for the statistical score and the rosetta-derived score because this second method contains many more adjustable parameters that are optimized to reproduce the average composition of these interfaces as observed in nature By contrast, the current statistical potential was generated ‘as is’ from the observed frequency of intermolecular con-tacts in the database of protein–RNA structures Thus, it appears that the distance-dependent statisti-cal potential implicitly captures at least some of the complexities of these intermolecular interactions that are explicitly enumerated in physical energy functions

The question of how to generate and discriminate near-native decoys is still an open challenge for many areas of computational structural biology [61,62] The docking decoy set used here contains many near-native decoys (e.g < 1 A˚ rmsd) that can be discriminated by the distance-dependent potential (Fig 1) However, when testing against the exceptionally near-native

Fig 5 Correlation between scores generated by the

distance-dependent statistical potential and experimental binding free

energies (logKd) for ribose-to-deoxyribose mutants of a universally

conserved protein component of the SRP.

Trang 9

decoys generated by extracting snapshots from MD

simulations (Table 2), we found that near non-native

decoys could not be reliably discriminated from native

structures, not even by amber, which was used to

con-duct the MD simulations Thus, the question of how

to create a potential that is sensitive to the extremely

subtle structural variations present in very near-native

decoys remains a challenging and important area of

research We are hopeful that the incorporation of

terms describing the higher-order geometric preferences

of protein–RNA interfaces (e.g the incorporation of a

directional hydrogen-bonding potential) [17] may

enhance the discriminatory power of our method, as

will the inevitable increase in high-resolution structural

data available for training Nevertheless, the

distance-dependent potential function already performs on par

with the amber and rosetta force fields in decoy

dis-crimination tests

The impact of contact distance cut-off on

discriminatory power

The contact distance cut-offs used in the present

study were varied to determine the value that

maxi-mizes decoy discrimination performance for protein–

RNA complexes Previously, Robertson et al [36]

showed that shorter contact cut-offs result in optimal

discrimination ability in protein–DNA complexes,

whereas Samudrala et al [21] found that a longer

cut-off (> 10 A˚) was better able to discriminate

cor-rect structures during protein structure prediction

experiments Finally, Lu et al [23] demonstrated that

the first coordination shell (i.e a cut-off between

3.5 A˚ and 6.5 A˚) achieves the greatest selectivity for

protein decoys created using gapless threading

pro-cedures; thus, the question remains as to the best

choice of contact cut-off

To evaluate the influence of different cut-off values

in our study, replicate experiments were conducted

using 6 A˚, 10 A˚ and 12 A˚ distance cut-offs In nearly

all of our tests, the use of a shorter contact cut-off

(6 A˚) results in greater selectivity for structural details

of the interface (Table 1) For the prediction of

mutation energies, however, a longer cut-off appears

to outperform shorter cut-off values for some sets of

mutation data (Table 3) Some of these mutations are

not near the protein–RNA interface (e.g one of the

U1A mutations, D79V, is 9 A˚ from the RNA

mole-cule), and only the use of a longer cut-off value can

capture these effects In light of the differing

conclu-sions of previous research [21,23,36], these results

imply that a ‘one size fits all’ approach to energy

function design may be limiting In other words, it

may be possible to significantly improve potential functions by customizing their parameterization to particular problems

Prediction of RNA recognition sequences from protein–RNA complex structures

An obvious but yet to be attempted application of any potential function for protein–RNA interactions

is the prediction of cognate binding sequences In a test of sequence recognition for 29 unique KH and RRM domains, we found that the potential is able to identify (within the 10th percentile) the cognate RNA recognition motifs of these domains approximately 70% of the time As not all RRM⁄ KH domains (for example, U1A) obey the simple four-nucleotide recog-nition model that we have introduced (where each nucleotide makes independent interaction with the protein) [6], and the specificity of some proteins is limited (i.e they bind nearly equally well to a set of related sequences), this is a remarkably strong result Despite the simple form of the statistical potential, and the over-simplifications of the four-nucleotide recognition model, this method is surprisingly robust over the diverse set of RNA-binding domains that we have considered

Prediction of relative protein–RNA binding energies

When we evaluated the relative free energy of a set

of mutations for several protein–RNA complexes of known structure, the distance-dependent potential was successful within defined structural classes We observed strong, statistically significant (P£ 0.05) score–energy correlations for several sets of mutations that we tested; however, to achieve these results, it was necessary to subdivide several of the mutation data sets For example, for the MS2 complex, the mutation data had to be divided into two classes based on the presence or absence of a cytosine at position )5 in the RNA A likely explanation for the importance of the )5 cytosine mutation is offered by the observation that the amino group of the cytosine at position )5 makes an intramolecular hydrogen bond that increases the propensity of the free RNA to adopt the structure seen in the complex [48] (Fig 3C) Because the dis-tance potential currently measures only intermolecular interactions, it is unable to capture the thermodynamic effect of interactions within the RNA or protein, and

of mutation-induced changes in RNA (or protein) structure The good correlations of distance potential with experimental binding energies (i.e when sequence

Trang 10

mutations are grouped according to the base identity

at position)5) strongly suggests that the potential

cap-tures the energetic contributions of intermolecular

interactions well

The same limitations observed in the MS2 mutation

data led to the failures in prediction for RNA mutations

in the Fox-1 and SRP complexes In the structure of the

Fox-1 complex, nucleotide U1 interacts with C3 by

forming an intramolecular hydrogen bond, whereas G2

and A4 form a non-Watson–Crick base pair [49]

(Fig 4) Four out of seven Fox-1 RNA mutations that

were tested directly affect these intramolecular

interac-tions, which are not evaluated by the statistical potential

used in the present study In the case of the RNA

muta-tions to the SRP complex, the mutated RNA residues

are located in a double-stranded region of RNA, and do

not interact with the protein [54], yet the disruption of

the helix clearly affects the binding energy The effect of

these changes in RNA conformation cannot be captured

by the intermolecular potential function used here

Given these observations, it is reasonable to

clude that the omission of protein intramolecular

con-tacts might also limit the predictive power of the

method However, additional examples will need to be

examined before definite conclusions can be made

con-cerning the applications of statistical potentials to

pre-diction of relative binding energies

The effect of training set composition

on potential function performance

All knowledge-based potentials face the possibility of

unintentional bias or over-training because their

train-ing depends upon the selection of a representative

sam-ple of structures If great care is not exercised to

ensure that this training set is unbiased (i.e

structur-ally heterogeneous), it is possible to create a statistical

potential that unfairly scores certain structures more

favorably than others simply because they are

over-represented in the training set

The challenge of over-fitting is particularly acute for

protein–RNA interactions because there are relatively

few high-resolution structures of protein–RNA

com-plexes Because of this limitation, a combined

train-ing⁄ test set was used in the present study To avoid

bias, a ‘leave one out’ cross-validation strategy was

employed: the tested structure was always excluded

from the training set Thus, every test in the present

study was conducted with a different score, and

trained using only those structures that were not

homologous to the tested protein–RNA complex

This strategy cannot be avoided at the present time,

yet it leads to situations where the training data does

not contain enough information to capture particular structural phenomena For example, we observed vir-tually no correlation between the distance-dependent score and the experimental binding affinity for muta-tions of U1A protein until the U1A complex structure was added to the training set (Table 3) Addition of the homologous U2B¢ complex structure (PDB code: 1A9N) to the training set improved these results con-siderably, indicating that the training set was missing critical structural information that would help to dis-criminate native-like contacts unique to the U1A com-plex (an unusually high-affinity RRM, with a long, seven-nucleotide recognition sequence) [52] We antici-pate that the performance of the method will improve with the size of the structural database, as more high-resolution protein–RNA structures become available

Conclusions

We have introduced a statistical potential function that discriminates the structures of native protein–RNA complexes from decoys, reproduces experimentally determined relative binding affinities for a number of RNA-binding proteins, and predicts cognate binding sequences for a large set of protein–RNA complexes The statistical potential performs as well as highly optimized physical potential functions in tests of docking decoy discrimination We anticipate that the performance of the potential will only increase with the size of the structural database and as terms are added to the model to account for protein and RNA intra-molecular interactions that are currently ignored Nevertheless, even in its current implementation, this statistical model achieves a high degree of sensitivity to subtle changes in protein–RNA interface structure We are optimistic that this knowledge-based potential function will find broad application to problems requiring the high-resolution modeling of protein– RNA interfaces, such as structure-based genome anno-tation, or the rational design of novel RNA-binding proteins

Experimental procedures

All-atom distance potential The potential function used here is identical to a previ-ously described method [36] (a more complete description

of the method is provided in supplementary Doc S1), with the exception of a modified low-count correction In the present study, the correction described by Sippl [20] is replaced with a weighted pseudocount method, where a constant number of pseudocounts (P) are added to the

Ngày đăng: 18/02/2014, 16:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm