Báo cáo khoa học: A hybrid clustering of protein binding sites ppt

Keywords binding sites; clustering; distance; OPTICS; PDB; sequence Correspondence V.. Clusters of similar binding sites were created here by a hybrid, sequence and spatial structure-bas

Trang 1

Ga´bor Iva´n1,2, Zolta´n Szabadka1,2and Vince Grolmusz1,2

1 Protein Information Technology Group, Department of Computer Science, Eo¨tvo¨s University, Budapest, Hungary

2 Uratim Ltd., Budapest, Hungary

Introduction

In recent years, the exploration of the human

gen-ome has received wide publicity Although sgen-omewhat

less emphasized, another important bioinformatics

resource is the exponentially growing, publicly

available Protein Data Bank (PDB) [1], containing

more than 55 000 biological structures at the present

time

The three-dimensional structures of small molecules,

e.g drug molecules, can usually be calculated from

their chemical composition Several databases exist

that contain millions of ligands An example of this is

the freely available ZINC database [2] created from

catalogues of compound manufacturers Contrary to

ligands, the three-dimensional structures of proteins

cannot be calculated easily; therefore, the rapid growth

of the PDB cannot be overestimated

Most antimicrobial drug molecules act as enzyme

inhibitors Inhibitors need to bind more strongly to the

enzyme than to the substrate of the enzyme;

conse-quently, the chemical and geometrical properties of the binding sites are of utmost importance in drug discov-ery and design

The PDB contains the three-dimensional structures

of more than 55 000 entries In a separate study [3],

we collected, verified and cleaned the list of approxi-mately 27 000 binding sites found in the PDB During the process of the identification of these binding sites, we filtered out crystallization artifacts and covalently bound small molecules, and also con-sidered broken peptide chains, modified amino acids and incorrectly labeled HET groups The resulting cleaned, strictly structured RS-PDB database [3] can serve as an input for different data mining algorithms One such technique of classification is clustering By the clustering of binding sites it is possible to create binding site similarity classes These classes can be useful for the classification of protein–ligand interaction

Keywords

binding sites; clustering; distance; OPTICS;

PDB; sequence

Correspondence

V Grolmusz, Protein Information

Technology Group, Department of

Computer Science, Eo¨tvo¨s University,

Pa´zma´ny Pe´ter stny 1 ⁄ C, H-1117 Budapest,

Hungary and Uratim Ltd., H-1118 Budapest,

Hungary

Fax: +36 1 381 2231

Tel: +36 1 381 2226

E-mail: grolmusz@cs.elte.hu

(Received 6 August 2009, revised 7 January

2010, accepted 12 January 2010)

doi:10.1111/j.1742-4658.2010.07578.x

The Protein Data Bank contains the description of approximately 27 000 protein–ligand binding sites Most of the ligands at these sites are biologi-cally active small molecules, affecting the biological function of the protein The classiﬁcation of their binding sites may lead to relevant results in drug discovery and design Clusters of similar binding sites were created here by

a hybrid, sequence and spatial structure-based approach, using the OPTICS clustering algorithm A dissimilarity measure was deﬁned: a dis-tance function on the amino acid sequences of the binding sites All the binding sites were clustered in the Protein Data Bank according to this dis-tance function, and it was found that the clusters characterized well the Enzyme Commission numbers of the entries The results, carefully color coded by the Enzyme Commission numbers of the proteins, containing the

20 967 binding sites clustered, are available as html ﬁles in three parts at http://pitgroup.org/seqclust/

Abbreviations

EC, Enzyme Commission; gp, gap penalty; OPTICS, Ordering Points to Identify the Clustering Structure; PDB, Protein Data Bank.

Trang 2

In this article, we present a fast, sequence-based

method for binding site clustering that takes into

account amino acid sequences in the close

neighbor-hood of binding sites Our method is a hybrid, in the

sense that it uses the sequence information together

with steric data from the PDB in a clearly structured

manner

Previous work

There is a very rich literature describing the

identiﬁ-cation techniques for biological functions from

struc-tural protein information by the application of highly

nontrivial mathematical tools [4,5] Some of these

tools have been applied to determine or analyze

protein–protein interaction network topology [6–10]

or binding sites [6,11] A considerable amount of

work has also been performed to devise polypeptide

sequence-order independent structural properties

[12–14] Unlike other binding site clustering solutions

in the literature ([15–18]), we used a hybrid of

order-independent methods that analyzes the

three-dimen-sional structure of the binding site together with an

order-analysis method; one of its main features is that

our order-analysis method is capable of handling

multiple polypeptide chains in the same binding site

(Fig 1)

Results and Discussion

Our main result was the OPTICS (Ordering Points

to Identify the Clustering Structure)-based clustering

of the 20 967 binding sites found In order to verify

the capabilities of the clustering method, we need to

compare the clusters found with veriﬁed biological

functions

Verification of results: biological relevance Ideally, proteins of the same or closely related functions ought to be assigned in the same cluster We considered the Enzyme Commission (EC) number classiﬁcation

of enzymes [19], and color coded the EC numbers such that closely related functions were given similar col-ors, as provided in http://pitgroup.org/seqclust/bsites_ AAcodes/EC_colour.html

The color-coded clusters, together with the ordinal number of the binding site, the PDB ID, the cluster

ID and the EC number can be found in three large html ﬁles (Page1, Page2, Page3) under http://pitgroup org/seqclust/ The clusters correspond to concave regions in the ﬁgure

The deviations of the EC numbers in all the clusters were also computed, and are given in the online table http://pitgroup.org/seqclust/bsites_AAcodes/EC_devia-tion.txt In most of the clusters, the deviation is zero; the average deviation is 1.71%

We believe that the validation of the enzymatic func-tions through EC numbers shows that our clustering method is an adequate solution for binding site cluster-ing and classiﬁcation

Parameter settings and examples

We present here, as examples, four binding sites from the largest cluster (element count: 448) (see Fig 2) All four proteins are blood clotting factors The whole cluster is given in the online ﬁgure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html

It should be noted that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases) From the second largest cluster (element count: 188), three binding sites were visualized (Fig 3) The whole cluster is given in the online ﬁgure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html

It should be noted that the whole cluster is colored deep violet, and almost all members of the cluster (between line numbers 1224 and 1411) have EC numbers 3.4.23.16 (HIV-1 retropepsins) More detailed analysis

of the homogeneity of the clusters is given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt

Clustering quality measurement The quality of clustering depends on several parame-ters These include the distance function used to deter-mine the similarity or distance of the objects and parameters of the clustering algorithm In order to

Fig 1 A binding site with four protein chains (PDBID: 1CT8) Each

chain is colored differently.

Trang 3

obtain appropriate feedback about the quality of

clustering with a given parameter setting, quality

metrics need to be deﬁned For this purpose, we used

the ‘silhouette coefﬁcient’ [20] The advantage of the

silhouette coefﬁcient is that it is completely independent

of the type of data being clustered; it uses only object

distances and cluster membership assignments for

its determination Basically, the silhouette coefﬁcient

measures how distinct are the clusters: the ‘silhouette value’ of a cluster is the smallest possible distance between an element of this cluster and an element of the neighboring clusters The silhouette coefficient of the overall clustering is the average of the silhouette values for the individual clusters More exactly, the silhouette coefficient is defined as the average of the silhouettes taken for all the objects; for example,

Fig 2 Four binding sites (PDB IDs: 1ZPB, 1RXP, 1C5Z, 2BZ6) from the same cluster The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_ AAcodes/bsites_optics_M02_No001.html Note that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases) More analysis

on the homogeneity of the clusters is given in http://pitgroup.org/seqclust/EC_ deviation.txt.

Fig 3 Three binding sites from the same cluster (one site from PDB ID 1BDL and two sites from PDB ID 1W5V); these are HIV-1 proteases The whole cluster is given

in the online figure http://www.pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_ M02_No001.html Note that the whole cluster is colored deep violet, and almost all the members of the cluster (between line numbers 1210 and 1435) have EC numbers

of the form 3.4.23.16 (HIV-1 retropepsins) More analysis on the homogeneity of the clusters is given in http://www.pitgroup.org/ seqclust/bsites_AAcodes/EC_deviation.txt.

Trang 4

the silhouette of object i is deﬁned as (bi– ai)⁄ max(ai,

bi), where ai is the average distance of object i to the

points of its cluster, and bi is the minimum of the

average distances of object i to other clusters It should

be noted that, typically, ai< bi, and so the silhouette

is equal to 1 – (ai⁄ bi) Clearly, for good clustering, the

typical ai value is much less than bi; therefore, the

silhouettes of the objects and the silhouette coefﬁcient

are close to unity

The data contained in Table 1 are based on

empiri-cal measurements The values of the silhouette

coefﬁ-cient are strongly dependent on the applied distance

function Therefore, it is questionable whether clusters

can be classiﬁed into rigid quality categories on the

basis of the silhouette coefﬁcient value However, it is

undoubtedly useful for comparing the quality of the

clusters

The silhouette coefﬁcient requires the clustering

algorithm to assign each binding site to a cluster by

deﬁnition Thus, the silhouette coefﬁcient value also

shows the amount of noise contained in the database

The clustering algorithm used in this study is the

OPTICS algorithm (see later) This algorithm allows

some binding sites to be marked as ‘noise’ (thus not

assigning them to any cluster) It does not seem

reason-able for binding sites that are ‘noise’ to be taken into

account twice (once, as the OPTICS algorithm marks

them, and once during the calculation of the silhouette

coefﬁcient) Therefore, binding sites marked as ‘noise’

were not taken into account when calculating the

silhou-ette coefﬁcient Nevertheless, for completeness, we show

(Fig 4) how the value of the silhouette coefﬁcient would

change if binding sites marked as ‘noise’ were taken into

consideration with a silhouette = 0 value

Effects of parameters on the quality of clustering

and cluster size distribution

Within our binding site model, the distance function and

clustering algorithm, three main parameters affected the

properties of clustering: OPTICS MINPTS, OPTICS

cut-off level and gap penalty (gp) of the distance func-tion We examined how these parameters affected the quality of clustering measured by the silhouette coefﬁ-cient The results are given in Figs 4 and 5

l Effect of gp Increasing gp improved slightly the quality of clustering This is understandable if we con-sider that the introduction of a less strict gp function automatically decreases the average distance between the clusters

l Effect of MINPTS On increasing MINPTS, two main effects were observed An increase in MINPTS yields better quality clustering However, it also yields

a lot more binding sites classiﬁed as ‘noise’ The main cause of the latter effect is that the clusters that exist

in the database, but contain less points than MINPTS, are not recognized; they are marked as ‘noise’ On the basis of this observation, it can be stated that our binding site database contains numerous small clusters

l Effect of OPTICS off level Increasing the cut-off level decreases the quality of clustering, and also the number of binding sites marked as ‘noise’ The application of an extremely high cut-off level places almost all binding sites into the same cluster; the qual-ity of such clustering can by no means be considered

as high

In conclusion, low MINPTS and low cut-off levels yield the best clustering quality (whilst covering 70– 80% of the binding sites found in the PDB) In Figs 4 and 5, we represent the dependence of clustering qual-ity on these parameters

Methods Binding site representation

As a ﬁrst step, an exact deﬁnition of a binding site must be provided For easy algorithmic handling, we stored the binding sites found in the PDB in a compact data structure

The definition of binding sites

A binding site is deﬁned as a set of atom pairs; the ﬁrst atom of the pair belongs to the protein, and the second atom to the bound ligand, such that their distance is equal

to the sum of the van der Waals’ radii, calculated differ-ently for different atom types That is, only pairs within noncovalent binding distances are included in the list Bind-ing sites containBind-ing covalently bound ligands are not con-sidered in this work, as our main motivation was to review pharmacologically signiﬁcant binding sites

A ‘binding amino acid (or residue)’ is an amino acid with

at least one of its atoms in the binding atom pair A ‘bind-ing amino acid sequence’ is an amino acid sequence that

Table 1 Cluster quality descriptions based on silhouette coefficient

values in [20].

Silhouette

coefficient Clustering quality

0.00–0.25 Clusters cannot be adequately

identified; cluster borders are not obvious 0.25–0.50 Clusters can be identified, but there

are numerous unclassifiable points (‘noise’) 0.50–0.70 Most of the data ⁄ points can be classified

0.70–1.00 Excellent distinguishable clusters

Trang 5

contains at least one binding amino acid Basically, binding

sites are represented by storing all the binding amino acid

sequences of all the protein chains that are present at the

particular binding site

Binding sites were extracted from the RS-PDB database

described in [21] and [3] By using this deﬁnition for

bind-ing sites, all amino acids from a given amino acid sequence

that have at least one atom contained in an atom pair set

(describing a binding site) can be identiﬁed

Residue sequence representation

An amino acid sequence refers to sequences consisting of amino acids connected by peptide bonds that are of maxi-mal length (i.e they cannot be continued with further amino acids on either end)

It should be noted that multiple amino acid sequences might occur in the immediate vicinity of a single binding site, making binding site distance⁄ similarity determina-tion fairly complicated An example of a binding site with four neighboring polypeptide chains can be seen in Fig 1

Binding amino acid sequences were ﬁrst extracted from the binding sites of the RS-PDB database [3,21] and then simpliﬁed as follows

A string was assigned to each amino acid sequence in a binding site In this string, residues participating in the bond were indicated by their one-character code; nonbind-ing amino acids were indicated by ‘-’ As our purpose was

to deal with only the binding sections, the pre- and post-ﬁxes consisting of purely nonbinding amino acids (or, in our notation, ‘-’) were deleted Hence, all the strings con-structed in this way start and end with a binding amino acid

A binding amino acid sequence constructed and trans-formed in this way (from PDB entry 2BZ6) is as follows:

H

TT–D

P .DSCK S VSWGQGC .G

Distance function

In order to use a clustering algorithm, we need to define a distance function The binding sites are represented by all amino acid sequences that participate in the bond with the ligand Consequently, we need to define the distance of the sequence sets situated in the binding sites This is accom-plished first by defining the distance of two sequences (described in the next section), and then by defining the distance of the sequence sets The reason for this comp-lexity is the fact that more than one binding sequence can

be present in a binding site (see Fig 1)

Sequence comparison algorithm

To measure the distances of the binding sections of amino acid sequences constructed in this way, we used a modiﬁed version of the algorithm employed to calculate the Levensh-tein distance (denoted as L) The modiﬁcations involved the assignment of different costs to gaps depending on where they were inserted, whereas amino acid mismatches were simply penalized by the value unity

Fig 4 Silhouette coefficient dependence on parameter MINPTS

when unclustered binding sites are also taken into account at

sil-houette coefficient determination (gp = 1 ⁄ 10) The color coding is

given in Table 2.

Fig 5 Number of binding sites contained in clusters as a function

of the number of clusters allowed to be used (gp = 1 ⁄ 10) The

color coding is given in Table 2.

Trang 6

The costs of aligned binding and nonbinding amino acids

were as follows:

l The cost of two aligned, different amino acids is unity

l The cost of aligned, matching amino acids is zero

Gaps were penalized as follows:

l The insertion of a gap with a length of one unit (one

amino acid) costs gp if the gap is aligned with a nonbinding

amino acid in the other sequence If a gap is aligned with a

binding amino acid, its cost is unity

l The insertion of gaps at the end of sequences is only

penalized if they are aligned with binding amino acids Gaps

inserted at either end of a sequence have a zero cost if they

are aligned with nonbinding amino acids

It can be shown that the Levenshtein distance (and

also our modiﬁed version) fulﬁlls the required properties

for being a metric Non-negativity and symmetry can be

seen directly from the deﬁnition (assuming non-negative

costs) It is also obvious that a zero distance can only be

achieved by comparing the same objects: L(x,y) = 0 if,

and only if, x = y (assuming that every compared

sequence starts and ends with a binding amino acid)

What is left to prove is the triangle inequality: for every

s, t, r strings (binding amino acid sequences), L(s,t)£

L(s,r) + L(r,t)

In other words, the triangle inequality asserts that

changing s to t via r cannot cost less than changing s to

t directly As the Levenshtein distance (by deﬁnition) is

the minimum possible total cost of operations

transform-ing s into t, and the sequence of operations that

trans-form s into r and then r into t is also an allowed

sequence of operations, it cannot have a lower total cost

than L(s,t), as this would contradict the optimality of

L(s,t) (What we may need to prove at this point is that

the algorithm used indeed calculates the deﬁned distance

– L.) This reasoning is also applicable to our modiﬁed

version of the Levenshtein distance; the only difference is

that we have a somewhat more sophisticated set of costs

for the insertion, deletion and changing of the characters

We assume that the costs are non-negative, and any

binding amino acid sequence compared with our distance

function starts and ends with a binding amino acid We

can now reformulate the above deﬁned costs to be used

with ‘insert’, ‘delete’, ‘change’ operations

Costs for insertion

l Insertion of ‘-’ to the end of the sequence: 0

l Insertion of ‘-’ between the ﬁrst and last binding amino

acids of the sequence: gp

l Insertion of a one-letter code of a binding amino acid: 1

Costs for deletion

l Deletion of ‘-’ from the end of the sequence: 0

l Deletion of ‘-’ between the ﬁrst and last unchanged

bind-ing amino acids of the sequence: gp

l Deletion of a one-letter code of a binding amino acid: 1

Costs for character change

l For matching characters: 0

l For nonmatching characters: 1

If we want to transform a binding amino acid sequence s into t using the above operations, we cannot expect to obtain a lower total cost by ﬁrst transforming s to an arbi-trary r and then r to t (compared with the direct transfor-mation of s to t) This means that the triangle inequality holds

Binding site comparison method

The input of the distance function described above is two strings that represent amino acid sequences extracted from binding sites However, our aim is to measure the distance of the binding sites, not just single amino acid sequences We have seen in section ’Previous work’ in Fig 1 that multiple amino acid sequences might occur in the immediate vicinity of a binding site Therefore, we also need to deﬁne the distance of the sequence sets representing binding sites

For this purpose, a complete bipartite graph is deﬁned This is a graph in which the set of vertices can be divided into two disjoint sets, A and B, such that no edge has both

of its endpoints in the same set, |A| = |B| and the number

of edges is always |A|Æ|B|

l Points of the vertex sets A and B correspond to the amino acid sequences of the ﬁrst and second binding sites, respectively If the numbers of amino acid sequences are not equal in the two binding sites, amino acid sequences with zero length are added to the smaller set

l Weights are assigned to all edges of this graph that corre-spond to the distance of the two amino acid sequences con-nected by the edge By ‘distance’, we mean the distance deﬁned in the previous section

The distance of the sequence sets A and B is then deﬁned

as the minimum weight perfect matching [22] in the graph deﬁned above

It should be noted that, by the deﬁnition of the previous section, the distance of an arbitrary residue sequence A to a zero-length sequence B is the binding amino acid count of sequence A

Binding site distance normalization

The expected distance of two randomly generated binding sites will be proportional to the sum of the binding amino acids occurring at the binding sites The maximum achiev-able distance is always less than the sum of the binding amino acids

The distance of two binding sites calculated using the function described in the previous section does not describe the binding site dissimilarity alone If the distance of two binding sites is three, it may be that they have three binding amino acids each, and hence they may be completely

Trang 7

differ-ent However, a distance of three between two binding sites

with 30 binding residues each is approximately a 10%

dif-ference, and so these binding sites might be almost the

same

Therefore, it is necessary to ‘normalize’ the distances We

did this by dividing all distances by the sum of the binding

amino acids of the two binding sites being compared The

result of this operation yields a value between zero and

unity that can also be interpreted as a percentage of the

absolute maximum possible distance of the two binding

sites

Clustering algorithm

For data clustering, we wanted to use an algorithm that

was not biased towards even-sized and regular-shaped

clusters

One algorithm with this properties is DBSCAN [23],

which is a density-based algorithm The density of objects is

deﬁned with a radius-like e parameter and an object-count

lower limit (MINPTS): a neighborhood of a certain object

‘o’ is considered to be dense if there exist at least MINPTS

objects within a distance of less than e Therefore, MINPTS and e are input parameters of the algorithm

Unfortunately, the clustering structure of many real data-sets cannot be characterized by global density parameters,

as quite different local densities may exist in different areas

of the data space The OPTICS algorithm [24] overcomes these difﬁculties by ordering the objects contained in the database, creating a so-called ‘reachability plot’ The reach-ability plot is a very clever visualization of high-dimen-sional clusters It is basically generated by assigning a value, called the ‘reachability distance’, to all the objects of the database, whilst going through the database points in a speciﬁc order The reachability distance is given on the y axes, and the objects (i.e binding site representations) are numbered on the x axes Clusters correspond to concave regions in the plot After the creation of the reachability plot, cluster membership assignments can be created by cut-ting the reachability plot with a horizontal line referred to

as the ‘cut-off level’

The reachability plot of a small database consisting of binding sites that contain NAD as the ligand is shown in Fig 6

Database parameters and further settings used in the OPTICS algorithm

The parameters used for clustering were as follows: OPTICS MINPTS, 2; OPTICS cut-off level, 20%; gp, 1⁄ 10 The OPTICS algorithm was run on a database consisting

of 20 967 binding sites Indistinguishable binding sites, which were assigned exactly to the same binding amino acid sequence sets and ligand identiﬁers, were contained only once (The original database without this kind of redundancy

Table 2 Colors assigned to different OPTICS cut-off levels.

Fig 6 OPTICS reachability plot of a database consisting of 800 binding sites.

Trang 8

ﬁltering consisted of 27 208 binding sites.) The distance of

the binding sites was measured with the distance function

described above

Using labeling encoding binding types

Following the suggestion of an anonymous referee, we

modiﬁed the labeling of the bond residues as follows: using

the approach ﬁrst described in [25], we replaced each amino

acid’s one-letter abbreviation with one of the following ﬁve

characters (‘A’, ‘D’, ‘H’, ‘C’, ‘P’) depending on the assumed

type of interaction between the given amino acid and the

ligand As several atoms of an amino acid can be located

within the ‘binding distance’ (deﬁned to be more than 1.25

times the sum of covalent radii belonging to the protein

and ligand atoms, respectively, but < 1.05 times the sum

of the van der Waals’ radii belonging to these atoms) for a

given amino acid, we only considered its closest atom to

the ligand Five types of interaction were used:

‘hydrogen-bond acceptor’ (denoted by ‘A’); ‘hydrogen-‘hydrogen-bond donor’

(denoted by ‘D’); ‘mixed hydrogen-bond donor⁄ acceptor’

(denoted by ‘H’, e.g hydroxyl groups or side-chain nitrogen

atoms in histidine); hydrophobic aliphatic interaction

(denoted by ‘C’); and aromatic (denoted by ‘P’); all are

described in [25]

Using this labeling, we applied the OPTICS algorithm,

exactly as described above The resulting clusters are given in

the second set of online supporting ﬁgures at http://pitgroup

org/seqclust, in four html ﬁles, together with a statistical

analysis

It is easy to see that, for the large clusters, the amino

acid labeling gives better results

Conclusions

In this article, we have presented a fast, sequence-based method capable of classifying the binding sites contained in the publicly available PDB We determined the parameter settings yielding a classification with the best quality (measured by the silhouette coefficient) Our main result was a sequence-based approach, derived from three-dimensional structures, used for binding site clustering (rather than three-dimensional binding site structure), that allows multiple sequences to occur at each binding site We also evaluated our clustering results with a large, colored diagram (given at the URL http://pitgroup.org/seqclust), where the colors corre-spond to the EC numbers of the proteins containing the binding sites As witnessed by the colored diagram, and also by the numerical deviations given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt, our method has a clear-cut biological significance The method presented in this work may help to reveal evolu-tionary related binding sites, and may also be used to filter redundancies (i.e multiple occurring binding sites) from the PDB A possible step for further research could

be the creation of aggregate sequence set proﬁles for each binding site cluster, generating binding site families similar to the Protein Families Database [26,27]

Acknowledgements This work was supported by Hungarian Scientiﬁc Research Fund (NK-67867, CNK-77780), and by the Hungarian National Ofﬁce for Research and Technol-ogy (OMFB-01295⁄ 2006 and OM-00219 ⁄ 2007)

References

1 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat

TN, Weissig H, Shindyalov IN & Bourne PE (2000) The Protein Data Bank Nucleic Acids Res 28, 235–242

2 Irwin JJ & Shoichet BK (2005) A free database of commercially available compounds for virtual screening

J Chem Inf Comput Sci 45, 177–182

3 Szabadka Z & Grolmusz V (2006) Building a struc-tured PDB: the RS-PDB database In: Proceedings of the 28th IEEE EMBS Annual International Conference, New York, NY, August 30–September 3, 2006,

pp 5755–5758 IEEE Press, New York, NY

4 Artamonova II, Frishman G, Gelfand MS & Frishman

D (2005) Mining sequence annotation databanks for association patterns Bioinformatics 21, iii49–iii57

5 Gunasekaran K, Ma B & Nussinov R (2004) Is allostery an intrinsic property of all dynamic proteins? Proteins 57, 433–443

Fig 7 A representative of cluster 85 in the online table http://

www.pitgroup.org/seqclust/bsites_pseudocenters/bsites_optics_

M04_No001.html Cluster 85 contains PDB entries 3B9J, 1FFU,

1JRP, 1T3Q, 2E3T, 1JRO, 1RM6, 1WY6, 1N5X; all of these contain

an Fe2⁄ S 2 cluster (FeS) bond.

Trang 9

6 Halperin I, Wolfson H & Nussinov R (2003) Sitelight:

binding-site prediction using phage display libraries

Protein Sci 12: 1344–1359

7 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ

(2005) Prediction of multimolecular assemblies by

multiple docking J Mol Biol 349, 435–447

8 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ

(2003) Protein structure prediction via combinatorial

assembly of sub-structural units Bioinformatics 19

(Suppl 1): i158–i168

9 Keskin O, Gursoy A, Ma B & Nussinov R (2007)

Towards drugs targeting multiple proteins in a systems

biology approach Curr Top Med Chem 7, 943–951

10 Keskin O, Nussinov R & Gursoy A (2008) Prism:

protein–protein interaction prediction by structural

matching Methods Mol Biol 484, 505–521

11 Keskin O & Nussinov R (2007) Similar binding sites

and different partners: implications to shared proteins

in cellular pathways Structure 15, 341–354

12 Tsai CJ, Lin SL, Wolfson HJ & Nussinov R (1996)

A dataset of protein–protein interfaces generated with a

sequence-order-independent comparison technique

J Mol Biol 260, 604–620

13 Alesker V, Nussinov R & Wolfson HJ (1996) Detection

of non-topological motifs in protein structures Protein

Eng 9, 1103–1119

14 Azarya-Sprinzak E, Naor D, Wolfson HJ & Nussinov

R (1997) Interchanges of spatially neighbouring residues

in structurally conserved environments Protein Eng 10,

1109–1122

15 Gold ND & Jackson RM (2006) Sitesbase: a database

for structure-based protein-ligand binding site

comparisons Nucleic Acids Res 34(Database issue):

D231–D234

16 Kinnings SL & Jackson RM (2009) Binding site

similarity analysis for the functional classiﬁcation of

the protein kinase family J Chem Inf Model 49,

318–329

17 Kuhn D, Weskamp N, Hazllermeier E and Klebe G

(2007) Functional classiﬁcation of protein kinase

bind-ing sites usbind-ing cavbase ChemMedChem 2, 1432–1447

18 Kinjo AR & Nakamura H (2009) Comprehensive struc-tural classiﬁcation of ligand-binding motifs in proteins Structure 17, 234–246

19 Webb EC (1989) Enzyme nomenclature recommenda-tions 1984 Supplement 2: correcrecommenda-tions and addirecommenda-tions Eur J Biochem 179, 489–533

20 Kaufman L & Rousseeuw P (1990) Finding Groups

in Data: An Introduction to Cluster Analysis Wiley, New York, NY

21 Szabadka Z & Grolmusz V (2007) High throughput processing of the structural information in the protein data bank J Mol Graph Model 25, 831–836

22 Lova´sz L & Plummer MD (1986) Matching Theory, Vol 121 of Holland Mathematics Studies North-Holland Publishing Co., Amsterdam Ann Discrete Mathematics 29

23 Ester M, H-Kriegel P, Sander J & Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp 226–231 AAAI Press

24 Ankerst M, Breunig MM, Kriegel H & Sander J (1999) Optics: ordering points to identify the clustering structure In: Proceedings of ACM SIGMOD ‘99 International Conference on Management of Data, Philadelphia, PA, 1999, pp 49–60 ACM Press

25 Schmitt S, Kuhn D & Klebe G (2002) A new method

to detect related function among proteins independent

of sequence and fold homology J Mol Biol 323, 387– 406

26 Sonnhammer EL, Eddy SR, Birney E, Bateman A & Durbin R (1998) Pfam: multiple sequence alignments and hmm-proﬁles of protein domains Nucleic Acids Res

26, 320–322

27 Sonnhammer EL, Eddy SR & Durbin R (1997) Pfam:

a comprehensive database of protein domain families based on seed alignments Proteins 28, 405–420

Định dạng
Số trang	9
Dung lượng	636,92 KB