Keywords binding sites; clustering; distance; OPTICS; PDB; sequence Correspondence V.. Clusters of similar binding sites were created here by a hybrid, sequence and spatial structure-bas
Trang 1Ga´bor Iva´n1,2, Zolta´n Szabadka1,2and Vince Grolmusz1,2
1 Protein Information Technology Group, Department of Computer Science, Eo¨tvo¨s University, Budapest, Hungary
2 Uratim Ltd., Budapest, Hungary
Introduction
In recent years, the exploration of the human
gen-ome has received wide publicity Although sgen-omewhat
less emphasized, another important bioinformatics
resource is the exponentially growing, publicly
available Protein Data Bank (PDB) [1], containing
more than 55 000 biological structures at the present
time
The three-dimensional structures of small molecules,
e.g drug molecules, can usually be calculated from
their chemical composition Several databases exist
that contain millions of ligands An example of this is
the freely available ZINC database [2] created from
catalogues of compound manufacturers Contrary to
ligands, the three-dimensional structures of proteins
cannot be calculated easily; therefore, the rapid growth
of the PDB cannot be overestimated
Most antimicrobial drug molecules act as enzyme
inhibitors Inhibitors need to bind more strongly to the
enzyme than to the substrate of the enzyme;
conse-quently, the chemical and geometrical properties of the binding sites are of utmost importance in drug discov-ery and design
The PDB contains the three-dimensional structures
of more than 55 000 entries In a separate study [3],
we collected, verified and cleaned the list of approxi-mately 27 000 binding sites found in the PDB During the process of the identification of these binding sites, we filtered out crystallization artifacts and covalently bound small molecules, and also con-sidered broken peptide chains, modified amino acids and incorrectly labeled HET groups The resulting cleaned, strictly structured RS-PDB database [3] can serve as an input for different data mining algorithms One such technique of classification is clustering By the clustering of binding sites it is possible to create binding site similarity classes These classes can be useful for the classification of protein–ligand interaction
Keywords
binding sites; clustering; distance; OPTICS;
PDB; sequence
Correspondence
V Grolmusz, Protein Information
Technology Group, Department of
Computer Science, Eo¨tvo¨s University,
Pa´zma´ny Pe´ter stny 1 ⁄ C, H-1117 Budapest,
Hungary and Uratim Ltd., H-1118 Budapest,
Hungary
Fax: +36 1 381 2231
Tel: +36 1 381 2226
E-mail: grolmusz@cs.elte.hu
(Received 6 August 2009, revised 7 January
2010, accepted 12 January 2010)
doi:10.1111/j.1742-4658.2010.07578.x
The Protein Data Bank contains the description of approximately 27 000 protein–ligand binding sites Most of the ligands at these sites are biologi-cally active small molecules, affecting the biological function of the protein The classification of their binding sites may lead to relevant results in drug discovery and design Clusters of similar binding sites were created here by
a hybrid, sequence and spatial structure-based approach, using the OPTICS clustering algorithm A dissimilarity measure was defined: a dis-tance function on the amino acid sequences of the binding sites All the binding sites were clustered in the Protein Data Bank according to this dis-tance function, and it was found that the clusters characterized well the Enzyme Commission numbers of the entries The results, carefully color coded by the Enzyme Commission numbers of the proteins, containing the
20 967 binding sites clustered, are available as html files in three parts at http://pitgroup.org/seqclust/
Abbreviations
EC, Enzyme Commission; gp, gap penalty; OPTICS, Ordering Points to Identify the Clustering Structure; PDB, Protein Data Bank.
Trang 2In this article, we present a fast, sequence-based
method for binding site clustering that takes into
account amino acid sequences in the close
neighbor-hood of binding sites Our method is a hybrid, in the
sense that it uses the sequence information together
with steric data from the PDB in a clearly structured
manner
Previous work
There is a very rich literature describing the
identifi-cation techniques for biological functions from
struc-tural protein information by the application of highly
nontrivial mathematical tools [4,5] Some of these
tools have been applied to determine or analyze
protein–protein interaction network topology [6–10]
or binding sites [6,11] A considerable amount of
work has also been performed to devise polypeptide
sequence-order independent structural properties
[12–14] Unlike other binding site clustering solutions
in the literature ([15–18]), we used a hybrid of
order-independent methods that analyzes the
three-dimen-sional structure of the binding site together with an
order-analysis method; one of its main features is that
our order-analysis method is capable of handling
multiple polypeptide chains in the same binding site
(Fig 1)
Results and Discussion
Our main result was the OPTICS (Ordering Points
to Identify the Clustering Structure)-based clustering
of the 20 967 binding sites found In order to verify
the capabilities of the clustering method, we need to
compare the clusters found with verified biological
functions
Verification of results: biological relevance Ideally, proteins of the same or closely related functions ought to be assigned in the same cluster We considered the Enzyme Commission (EC) number classification
of enzymes [19], and color coded the EC numbers such that closely related functions were given similar col-ors, as provided in http://pitgroup.org/seqclust/bsites_ AAcodes/EC_colour.html
The color-coded clusters, together with the ordinal number of the binding site, the PDB ID, the cluster
ID and the EC number can be found in three large html files (Page1, Page2, Page3) under http://pitgroup org/seqclust/ The clusters correspond to concave regions in the figure
The deviations of the EC numbers in all the clusters were also computed, and are given in the online table http://pitgroup.org/seqclust/bsites_AAcodes/EC_devia-tion.txt In most of the clusters, the deviation is zero; the average deviation is 1.71%
We believe that the validation of the enzymatic func-tions through EC numbers shows that our clustering method is an adequate solution for binding site cluster-ing and classification
Parameter settings and examples
We present here, as examples, four binding sites from the largest cluster (element count: 448) (see Fig 2) All four proteins are blood clotting factors The whole cluster is given in the online figure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html
It should be noted that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases) From the second largest cluster (element count: 188), three binding sites were visualized (Fig 3) The whole cluster is given in the online figure http://pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_M02_No001.html
It should be noted that the whole cluster is colored deep violet, and almost all members of the cluster (between line numbers 1224 and 1411) have EC numbers 3.4.23.16 (HIV-1 retropepsins) More detailed analysis
of the homogeneity of the clusters is given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt
Clustering quality measurement The quality of clustering depends on several parame-ters These include the distance function used to deter-mine the similarity or distance of the objects and parameters of the clustering algorithm In order to
Fig 1 A binding site with four protein chains (PDBID: 1CT8) Each
chain is colored differently.
Trang 3obtain appropriate feedback about the quality of
clustering with a given parameter setting, quality
metrics need to be defined For this purpose, we used
the ‘silhouette coefficient’ [20] The advantage of the
silhouette coefficient is that it is completely independent
of the type of data being clustered; it uses only object
distances and cluster membership assignments for
its determination Basically, the silhouette coefficient
measures how distinct are the clusters: the ‘silhouette value’ of a cluster is the smallest possible distance between an element of this cluster and an element of the neighboring clusters The silhouette coefficient of the overall clustering is the average of the silhouette values for the individual clusters More exactly, the silhouette coefficient is defined as the average of the silhouettes taken for all the objects; for example,
Fig 2 Four binding sites (PDB IDs: 1ZPB, 1RXP, 1C5Z, 2BZ6) from the same cluster The whole cluster is given in the online figure http://pitgroup.org/seqclust/bsites_ AAcodes/bsites_optics_M02_No001.html Note that the whole cluster is colored blue, and all the members of the cluster (between line numbers 702 and 1149; cluster ID: 28) have EC numbers of the form 3.4.21.X (serine proteases) More analysis
on the homogeneity of the clusters is given in http://pitgroup.org/seqclust/EC_ deviation.txt.
Fig 3 Three binding sites from the same cluster (one site from PDB ID 1BDL and two sites from PDB ID 1W5V); these are HIV-1 proteases The whole cluster is given
in the online figure http://www.pitgroup.org/ seqclust/bsites_AAcodes/bsites_optics_ M02_No001.html Note that the whole cluster is colored deep violet, and almost all the members of the cluster (between line numbers 1210 and 1435) have EC numbers
of the form 3.4.23.16 (HIV-1 retropepsins) More analysis on the homogeneity of the clusters is given in http://www.pitgroup.org/ seqclust/bsites_AAcodes/EC_deviation.txt.
Trang 4the silhouette of object i is defined as (bi– ai)⁄ max(ai,
bi), where ai is the average distance of object i to the
points of its cluster, and bi is the minimum of the
average distances of object i to other clusters It should
be noted that, typically, ai< bi, and so the silhouette
is equal to 1 – (ai⁄ bi) Clearly, for good clustering, the
typical ai value is much less than bi; therefore, the
silhouettes of the objects and the silhouette coefficient
are close to unity
The data contained in Table 1 are based on
empiri-cal measurements The values of the silhouette
coeffi-cient are strongly dependent on the applied distance
function Therefore, it is questionable whether clusters
can be classified into rigid quality categories on the
basis of the silhouette coefficient value However, it is
undoubtedly useful for comparing the quality of the
clusters
The silhouette coefficient requires the clustering
algorithm to assign each binding site to a cluster by
definition Thus, the silhouette coefficient value also
shows the amount of noise contained in the database
The clustering algorithm used in this study is the
OPTICS algorithm (see later) This algorithm allows
some binding sites to be marked as ‘noise’ (thus not
assigning them to any cluster) It does not seem
reason-able for binding sites that are ‘noise’ to be taken into
account twice (once, as the OPTICS algorithm marks
them, and once during the calculation of the silhouette
coefficient) Therefore, binding sites marked as ‘noise’
were not taken into account when calculating the
silhou-ette coefficient Nevertheless, for completeness, we show
(Fig 4) how the value of the silhouette coefficient would
change if binding sites marked as ‘noise’ were taken into
consideration with a silhouette = 0 value
Effects of parameters on the quality of clustering
and cluster size distribution
Within our binding site model, the distance function and
clustering algorithm, three main parameters affected the
properties of clustering: OPTICS MINPTS, OPTICS
cut-off level and gap penalty (gp) of the distance func-tion We examined how these parameters affected the quality of clustering measured by the silhouette coeffi-cient The results are given in Figs 4 and 5
l Effect of gp Increasing gp improved slightly the quality of clustering This is understandable if we con-sider that the introduction of a less strict gp function automatically decreases the average distance between the clusters
l Effect of MINPTS On increasing MINPTS, two main effects were observed An increase in MINPTS yields better quality clustering However, it also yields
a lot more binding sites classified as ‘noise’ The main cause of the latter effect is that the clusters that exist
in the database, but contain less points than MINPTS, are not recognized; they are marked as ‘noise’ On the basis of this observation, it can be stated that our binding site database contains numerous small clusters
l Effect of OPTICS off level Increasing the cut-off level decreases the quality of clustering, and also the number of binding sites marked as ‘noise’ The application of an extremely high cut-off level places almost all binding sites into the same cluster; the qual-ity of such clustering can by no means be considered
as high
In conclusion, low MINPTS and low cut-off levels yield the best clustering quality (whilst covering 70– 80% of the binding sites found in the PDB) In Figs 4 and 5, we represent the dependence of clustering qual-ity on these parameters
Methods Binding site representation
As a first step, an exact definition of a binding site must be provided For easy algorithmic handling, we stored the binding sites found in the PDB in a compact data structure
The definition of binding sites
A binding site is defined as a set of atom pairs; the first atom of the pair belongs to the protein, and the second atom to the bound ligand, such that their distance is equal
to the sum of the van der Waals’ radii, calculated differ-ently for different atom types That is, only pairs within noncovalent binding distances are included in the list Bind-ing sites containBind-ing covalently bound ligands are not con-sidered in this work, as our main motivation was to review pharmacologically significant binding sites
A ‘binding amino acid (or residue)’ is an amino acid with
at least one of its atoms in the binding atom pair A ‘bind-ing amino acid sequence’ is an amino acid sequence that
Table 1 Cluster quality descriptions based on silhouette coefficient
values in [20].
Silhouette
coefficient Clustering quality
0.00–0.25 Clusters cannot be adequately
identified; cluster borders are not obvious 0.25–0.50 Clusters can be identified, but there
are numerous unclassifiable points (‘noise’) 0.50–0.70 Most of the data ⁄ points can be classified
0.70–1.00 Excellent distinguishable clusters
Trang 5contains at least one binding amino acid Basically, binding
sites are represented by storing all the binding amino acid
sequences of all the protein chains that are present at the
particular binding site
Binding sites were extracted from the RS-PDB database
described in [21] and [3] By using this definition for
bind-ing sites, all amino acids from a given amino acid sequence
that have at least one atom contained in an atom pair set
(describing a binding site) can be identified
Residue sequence representation
An amino acid sequence refers to sequences consisting of amino acids connected by peptide bonds that are of maxi-mal length (i.e they cannot be continued with further amino acids on either end)
It should be noted that multiple amino acid sequences might occur in the immediate vicinity of a single binding site, making binding site distance⁄ similarity determina-tion fairly complicated An example of a binding site with four neighboring polypeptide chains can be seen in Fig 1
Binding amino acid sequences were first extracted from the binding sites of the RS-PDB database [3,21] and then simplified as follows
A string was assigned to each amino acid sequence in a binding site In this string, residues participating in the bond were indicated by their one-character code; nonbind-ing amino acids were indicated by ‘-’ As our purpose was
to deal with only the binding sections, the pre- and post-fixes consisting of purely nonbinding amino acids (or, in our notation, ‘-’) were deleted Hence, all the strings con-structed in this way start and end with a binding amino acid
A binding amino acid sequence constructed and trans-formed in this way (from PDB entry 2BZ6) is as follows:
H
TT–D
P .DSCK S VSWGQGC .G
Distance function
In order to use a clustering algorithm, we need to define a distance function The binding sites are represented by all amino acid sequences that participate in the bond with the ligand Consequently, we need to define the distance of the sequence sets situated in the binding sites This is accom-plished first by defining the distance of two sequences (described in the next section), and then by defining the distance of the sequence sets The reason for this comp-lexity is the fact that more than one binding sequence can
be present in a binding site (see Fig 1)
Sequence comparison algorithm
To measure the distances of the binding sections of amino acid sequences constructed in this way, we used a modified version of the algorithm employed to calculate the Levensh-tein distance (denoted as L) The modifications involved the assignment of different costs to gaps depending on where they were inserted, whereas amino acid mismatches were simply penalized by the value unity
Fig 4 Silhouette coefficient dependence on parameter MINPTS
when unclustered binding sites are also taken into account at
sil-houette coefficient determination (gp = 1 ⁄ 10) The color coding is
given in Table 2.
Fig 5 Number of binding sites contained in clusters as a function
of the number of clusters allowed to be used (gp = 1 ⁄ 10) The
color coding is given in Table 2.
Trang 6The costs of aligned binding and nonbinding amino acids
were as follows:
l The cost of two aligned, different amino acids is unity
l The cost of aligned, matching amino acids is zero
Gaps were penalized as follows:
l The insertion of a gap with a length of one unit (one
amino acid) costs gp if the gap is aligned with a nonbinding
amino acid in the other sequence If a gap is aligned with a
binding amino acid, its cost is unity
l The insertion of gaps at the end of sequences is only
penalized if they are aligned with binding amino acids Gaps
inserted at either end of a sequence have a zero cost if they
are aligned with nonbinding amino acids
It can be shown that the Levenshtein distance (and
also our modified version) fulfills the required properties
for being a metric Non-negativity and symmetry can be
seen directly from the definition (assuming non-negative
costs) It is also obvious that a zero distance can only be
achieved by comparing the same objects: L(x,y) = 0 if,
and only if, x = y (assuming that every compared
sequence starts and ends with a binding amino acid)
What is left to prove is the triangle inequality: for every
s, t, r strings (binding amino acid sequences), L(s,t)£
L(s,r) + L(r,t)
In other words, the triangle inequality asserts that
changing s to t via r cannot cost less than changing s to
t directly As the Levenshtein distance (by definition) is
the minimum possible total cost of operations
transform-ing s into t, and the sequence of operations that
trans-form s into r and then r into t is also an allowed
sequence of operations, it cannot have a lower total cost
than L(s,t), as this would contradict the optimality of
L(s,t) (What we may need to prove at this point is that
the algorithm used indeed calculates the defined distance
– L.) This reasoning is also applicable to our modified
version of the Levenshtein distance; the only difference is
that we have a somewhat more sophisticated set of costs
for the insertion, deletion and changing of the characters
We assume that the costs are non-negative, and any
binding amino acid sequence compared with our distance
function starts and ends with a binding amino acid We
can now reformulate the above defined costs to be used
with ‘insert’, ‘delete’, ‘change’ operations
Costs for insertion
l Insertion of ‘-’ to the end of the sequence: 0
l Insertion of ‘-’ between the first and last binding amino
acids of the sequence: gp
l Insertion of a one-letter code of a binding amino acid: 1
Costs for deletion
l Deletion of ‘-’ from the end of the sequence: 0
l Deletion of ‘-’ between the first and last unchanged
bind-ing amino acids of the sequence: gp
l Deletion of a one-letter code of a binding amino acid: 1
Costs for character change
l For matching characters: 0
l For nonmatching characters: 1
If we want to transform a binding amino acid sequence s into t using the above operations, we cannot expect to obtain a lower total cost by first transforming s to an arbi-trary r and then r to t (compared with the direct transfor-mation of s to t) This means that the triangle inequality holds
Binding site comparison method
The input of the distance function described above is two strings that represent amino acid sequences extracted from binding sites However, our aim is to measure the distance of the binding sites, not just single amino acid sequences We have seen in section ’Previous work’ in Fig 1 that multiple amino acid sequences might occur in the immediate vicinity of a binding site Therefore, we also need to define the distance of the sequence sets representing binding sites
For this purpose, a complete bipartite graph is defined This is a graph in which the set of vertices can be divided into two disjoint sets, A and B, such that no edge has both
of its endpoints in the same set, |A| = |B| and the number
of edges is always |A|Æ|B|
l Points of the vertex sets A and B correspond to the amino acid sequences of the first and second binding sites, respectively If the numbers of amino acid sequences are not equal in the two binding sites, amino acid sequences with zero length are added to the smaller set
l Weights are assigned to all edges of this graph that corre-spond to the distance of the two amino acid sequences con-nected by the edge By ‘distance’, we mean the distance defined in the previous section
The distance of the sequence sets A and B is then defined
as the minimum weight perfect matching [22] in the graph defined above
It should be noted that, by the definition of the previous section, the distance of an arbitrary residue sequence A to a zero-length sequence B is the binding amino acid count of sequence A
Binding site distance normalization
The expected distance of two randomly generated binding sites will be proportional to the sum of the binding amino acids occurring at the binding sites The maximum achiev-able distance is always less than the sum of the binding amino acids
The distance of two binding sites calculated using the function described in the previous section does not describe the binding site dissimilarity alone If the distance of two binding sites is three, it may be that they have three binding amino acids each, and hence they may be completely
Trang 7differ-ent However, a distance of three between two binding sites
with 30 binding residues each is approximately a 10%
dif-ference, and so these binding sites might be almost the
same
Therefore, it is necessary to ‘normalize’ the distances We
did this by dividing all distances by the sum of the binding
amino acids of the two binding sites being compared The
result of this operation yields a value between zero and
unity that can also be interpreted as a percentage of the
absolute maximum possible distance of the two binding
sites
Clustering algorithm
For data clustering, we wanted to use an algorithm that
was not biased towards even-sized and regular-shaped
clusters
One algorithm with this properties is DBSCAN [23],
which is a density-based algorithm The density of objects is
defined with a radius-like e parameter and an object-count
lower limit (MINPTS): a neighborhood of a certain object
‘o’ is considered to be dense if there exist at least MINPTS
objects within a distance of less than e Therefore, MINPTS and e are input parameters of the algorithm
Unfortunately, the clustering structure of many real data-sets cannot be characterized by global density parameters,
as quite different local densities may exist in different areas
of the data space The OPTICS algorithm [24] overcomes these difficulties by ordering the objects contained in the database, creating a so-called ‘reachability plot’ The reach-ability plot is a very clever visualization of high-dimen-sional clusters It is basically generated by assigning a value, called the ‘reachability distance’, to all the objects of the database, whilst going through the database points in a specific order The reachability distance is given on the y axes, and the objects (i.e binding site representations) are numbered on the x axes Clusters correspond to concave regions in the plot After the creation of the reachability plot, cluster membership assignments can be created by cut-ting the reachability plot with a horizontal line referred to
as the ‘cut-off level’
The reachability plot of a small database consisting of binding sites that contain NAD as the ligand is shown in Fig 6
Database parameters and further settings used in the OPTICS algorithm
The parameters used for clustering were as follows: OPTICS MINPTS, 2; OPTICS cut-off level, 20%; gp, 1⁄ 10 The OPTICS algorithm was run on a database consisting
of 20 967 binding sites Indistinguishable binding sites, which were assigned exactly to the same binding amino acid sequence sets and ligand identifiers, were contained only once (The original database without this kind of redundancy
Table 2 Colors assigned to different OPTICS cut-off levels.
Fig 6 OPTICS reachability plot of a database consisting of 800 binding sites.
Trang 8filtering consisted of 27 208 binding sites.) The distance of
the binding sites was measured with the distance function
described above
Using labeling encoding binding types
Following the suggestion of an anonymous referee, we
modified the labeling of the bond residues as follows: using
the approach first described in [25], we replaced each amino
acid’s one-letter abbreviation with one of the following five
characters (‘A’, ‘D’, ‘H’, ‘C’, ‘P’) depending on the assumed
type of interaction between the given amino acid and the
ligand As several atoms of an amino acid can be located
within the ‘binding distance’ (defined to be more than 1.25
times the sum of covalent radii belonging to the protein
and ligand atoms, respectively, but < 1.05 times the sum
of the van der Waals’ radii belonging to these atoms) for a
given amino acid, we only considered its closest atom to
the ligand Five types of interaction were used:
‘hydrogen-bond acceptor’ (denoted by ‘A’); ‘hydrogen-‘hydrogen-bond donor’
(denoted by ‘D’); ‘mixed hydrogen-bond donor⁄ acceptor’
(denoted by ‘H’, e.g hydroxyl groups or side-chain nitrogen
atoms in histidine); hydrophobic aliphatic interaction
(denoted by ‘C’); and aromatic (denoted by ‘P’); all are
described in [25]
Using this labeling, we applied the OPTICS algorithm,
exactly as described above The resulting clusters are given in
the second set of online supporting figures at http://pitgroup
org/seqclust, in four html files, together with a statistical
analysis
It is easy to see that, for the large clusters, the amino
acid labeling gives better results
Conclusions
In this article, we have presented a fast, sequence-based method capable of classifying the binding sites contained in the publicly available PDB We determined the parameter settings yielding a classification with the best quality (measured by the silhouette coefficient) Our main result was a sequence-based approach, derived from three-dimensional structures, used for binding site clustering (rather than three-dimensional binding site structure), that allows multiple sequences to occur at each binding site We also evaluated our clustering results with a large, colored diagram (given at the URL http://pitgroup.org/seqclust), where the colors corre-spond to the EC numbers of the proteins containing the binding sites As witnessed by the colored diagram, and also by the numerical deviations given in http:// pitgroup.org/seqclust/bsites_AAcodes/EC_deviation.txt, our method has a clear-cut biological significance The method presented in this work may help to reveal evolu-tionary related binding sites, and may also be used to filter redundancies (i.e multiple occurring binding sites) from the PDB A possible step for further research could
be the creation of aggregate sequence set profiles for each binding site cluster, generating binding site families similar to the Protein Families Database [26,27]
Acknowledgements This work was supported by Hungarian Scientific Research Fund (NK-67867, CNK-77780), and by the Hungarian National Office for Research and Technol-ogy (OMFB-01295⁄ 2006 and OM-00219 ⁄ 2007)
References
1 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat
TN, Weissig H, Shindyalov IN & Bourne PE (2000) The Protein Data Bank Nucleic Acids Res 28, 235–242
2 Irwin JJ & Shoichet BK (2005) A free database of commercially available compounds for virtual screening
J Chem Inf Comput Sci 45, 177–182
3 Szabadka Z & Grolmusz V (2006) Building a struc-tured PDB: the RS-PDB database In: Proceedings of the 28th IEEE EMBS Annual International Conference, New York, NY, August 30–September 3, 2006,
pp 5755–5758 IEEE Press, New York, NY
4 Artamonova II, Frishman G, Gelfand MS & Frishman
D (2005) Mining sequence annotation databanks for association patterns Bioinformatics 21, iii49–iii57
5 Gunasekaran K, Ma B & Nussinov R (2004) Is allostery an intrinsic property of all dynamic proteins? Proteins 57, 433–443
Fig 7 A representative of cluster 85 in the online table http://
www.pitgroup.org/seqclust/bsites_pseudocenters/bsites_optics_
M04_No001.html Cluster 85 contains PDB entries 3B9J, 1FFU,
1JRP, 1T3Q, 2E3T, 1JRO, 1RM6, 1WY6, 1N5X; all of these contain
an Fe2⁄ S 2 cluster (FeS) bond.
Trang 96 Halperin I, Wolfson H & Nussinov R (2003) Sitelight:
binding-site prediction using phage display libraries
Protein Sci 12: 1344–1359
7 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ
(2005) Prediction of multimolecular assemblies by
multiple docking J Mol Biol 349, 435–447
8 Inbar Y, Benyamini H, Nussinov R & Wolfson HJ
(2003) Protein structure prediction via combinatorial
assembly of sub-structural units Bioinformatics 19
(Suppl 1): i158–i168
9 Keskin O, Gursoy A, Ma B & Nussinov R (2007)
Towards drugs targeting multiple proteins in a systems
biology approach Curr Top Med Chem 7, 943–951
10 Keskin O, Nussinov R & Gursoy A (2008) Prism:
protein–protein interaction prediction by structural
matching Methods Mol Biol 484, 505–521
11 Keskin O & Nussinov R (2007) Similar binding sites
and different partners: implications to shared proteins
in cellular pathways Structure 15, 341–354
12 Tsai CJ, Lin SL, Wolfson HJ & Nussinov R (1996)
A dataset of protein–protein interfaces generated with a
sequence-order-independent comparison technique
J Mol Biol 260, 604–620
13 Alesker V, Nussinov R & Wolfson HJ (1996) Detection
of non-topological motifs in protein structures Protein
Eng 9, 1103–1119
14 Azarya-Sprinzak E, Naor D, Wolfson HJ & Nussinov
R (1997) Interchanges of spatially neighbouring residues
in structurally conserved environments Protein Eng 10,
1109–1122
15 Gold ND & Jackson RM (2006) Sitesbase: a database
for structure-based protein-ligand binding site
comparisons Nucleic Acids Res 34(Database issue):
D231–D234
16 Kinnings SL & Jackson RM (2009) Binding site
similarity analysis for the functional classification of
the protein kinase family J Chem Inf Model 49,
318–329
17 Kuhn D, Weskamp N, Hazllermeier E and Klebe G
(2007) Functional classification of protein kinase
bind-ing sites usbind-ing cavbase ChemMedChem 2, 1432–1447
18 Kinjo AR & Nakamura H (2009) Comprehensive struc-tural classification of ligand-binding motifs in proteins Structure 17, 234–246
19 Webb EC (1989) Enzyme nomenclature recommenda-tions 1984 Supplement 2: correcrecommenda-tions and addirecommenda-tions Eur J Biochem 179, 489–533
20 Kaufman L & Rousseeuw P (1990) Finding Groups
in Data: An Introduction to Cluster Analysis Wiley, New York, NY
21 Szabadka Z & Grolmusz V (2007) High throughput processing of the structural information in the protein data bank J Mol Graph Model 25, 831–836
22 Lova´sz L & Plummer MD (1986) Matching Theory, Vol 121 of Holland Mathematics Studies North-Holland Publishing Co., Amsterdam Ann Discrete Mathematics 29
23 Ester M, H-Kriegel P, Sander J & Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp 226–231 AAAI Press
24 Ankerst M, Breunig MM, Kriegel H & Sander J (1999) Optics: ordering points to identify the clustering structure In: Proceedings of ACM SIGMOD ‘99 International Conference on Management of Data, Philadelphia, PA, 1999, pp 49–60 ACM Press
25 Schmitt S, Kuhn D & Klebe G (2002) A new method
to detect related function among proteins independent
of sequence and fold homology J Mol Biol 323, 387– 406
26 Sonnhammer EL, Eddy SR, Birney E, Bateman A & Durbin R (1998) Pfam: multiple sequence alignments and hmm-profiles of protein domains Nucleic Acids Res
26, 320–322
27 Sonnhammer EL, Eddy SR & Durbin R (1997) Pfam:
a comprehensive database of protein domain families based on seed alignments Proteins 28, 405–420