| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



* Department of Biomedical Engineering and
Bioinformatics Graduate Program, Boston University, Boston, Massachusetts; and
Department of Computational Biology, University of Pittsburgh, Pittsburgh, Pennsylvania
Correspondence: Address reprint requests to C. J. Camacho, E-mail: ccamacho{at}pitt.edu.
| ABSTRACT |
|---|
|
|
|---|
2 Å). Without any a priori information, a simple analysis of the histogram of distance separations between the set of docked conformations can evaluate the clustering properties of the data set. Clustering is observed when the histogram is bimodal. Data clustering is optimal if one chooses the clustering radius to be the minimum after the first peak of the bimodal distribution. We show that using this optimal radius further improves the discrimination of near-native complex structures. | INTRODUCTION |
|---|
|
|
|---|
Most macromolecular interactions require a rapid and highly specific association process. A successful reaction between proteins requires the appropriate encounter of a reactive patch. This is often achieved by long-range electrostatic and/or desolvation forces that bias the approach of the molecules to favor reactive conditions. This steering leads to the clustering of ligands near their binding region, thus speeding up the reactions. Quantitative analyses of the protein binding free energy (6
11
) have confirmed this rationale by establishing a direct relationship between clustering and the prediction of protein interactions.
Clustering of bound conformations near the native state has also been observed in protein-small molecule interactions, both experimentally and computationally. X-ray and NMR structures of proteins, determined in aqueous solutions of organic solvents, show that the organic molecules cluster in locations near the active site of enzymes, delineating the binding pockets (12
16
; see also Ref. 17
for a cluster analysis of bound water molecules). All other bound molecules are either in crystal contacts, occur only at high ligand concentrations, or are in small pockets that can only accommodate a single molecule rather than an entire cluster. This evidence strongly suggests that clustering low free-energy docked conformations should again be beneficial in identifying the active site in proteins, particularly when considering "consensus sites", i.e., the surface regions in which six or seven different small compounds cluster.
In this article we discuss the application of simple clustering strategies to the above two problems. Considering a free-energy surface with multiple minima, it is obvious that conformations with free energies below a certain threshold will form a number of clusters (see Fig. 1) and that most of these clusters will remain largely invariant for threshold values within a certain free-energy range. Accordingly, many docking and conformational search algorithms use clustering simply for reducing the number of conformations. We emphasize that clustering is much more central to the strategies we describe here, because looking for large clusters is the major tool of finding near-native conformations. We show that clustering provides significant improvements for the prediction of protein complex structures over the traditional re-scoring and ranking of the conformations using some type of potential. More interestingly, we find that the clustering radius is not arbitrary but reflects the dominant terms of the interaction free energy and the size of the main attractors in the binding free-energy landscape. Without any a priori knowledge of the complex structure, we develop a methodology to predict an optimal clustering radius and show that this radius further improves the discrimination of the native state. A rigorous clustering analysis should differentiate between anecdotal (or artificial) clustering and one due to the biophysical mechanism of the problem at hand.
|
| METHODS |
|---|
|
|
|---|
Clustering method
The clustering algorithm, used for ranking and discrimination of protein-protein complex structures, clusters the 4N (default 2000) receptor-ligand filtered structures according to the root-mean-squared deviations (RMSDs) of the ligand atoms that are within 10 Å of any atom on the fixed receptor. We use a simple greedy algorithm to find the structures with the largest number of neighbors within a certain clustering radius RC (the default value is RC = 9 Å). The structure with the highest number of neighbors within the selected cluster radius is considered as the center of the first ranked cluster. The members of this cluster are removed, and we select the next structure with the highest number of neighbors from the remaining ligands, usually generating and analyzing the top 30 clusters. The clustering and docking method has been implemented as a public server named ClusPro at http://structure.bu.edu (18
), and the algorithm has been used with success in the first Critical Assessment of PRedicted Interactions (CAPRI) experiment (22
,23
).
Pairwise RMSD distribution of docked conformations
To analyze the clustering properties of free-energy filtered docked conformations, we compute the pairwise RMSD histogram of all docked conformations. To understand this simple analysis, consider a set of points in the plane, and construct the histogram of pairwise distances, i.e., plot the number of points that are within a distance r to any other point as a function of r. If the points are randomly distributed, the plot is smooth with no characteristic length scale. However, if the points cluster within a radius R (see, e.g., Fig. 2 A), then the distribution will have a peak, followed by a minimum, at
r = R. Fig. 2 B shows the distributions both for a set of random points, and the set of points that cluster with a radius of five units.
|
Clustering parameter 
For protein docking, a typical distribution of the pairwise RMSD of free-energy filtered data sets of docked conformations is shown in Fig. 3. The optimal clustering radius can readily be computed from distribution as the minimum after the peak at
7 Å. To quantify the quality of the clustering in Fig. 3 we define the parameter
that measures the depth of the separation between the two peaks of the distribution. If
= 0, there is no separation of length scales between clusters; if
= 1, the separation of length scales and clustering are optimal.
|
Clustering of small molecules
Similarly to the protein-protein docking, we filter the generated structures, but in the case of small molecules this step also involves clustering. Initially, the two most distant of the minimized probe conformations are designated as hubs for clustering the remaining conformations. A new hub, the most distant probe from the current hubs, is designated when necessary until all of the probes are clustered such that the maximum distance between a cluster's hub and any of its members (the cluster radius) is less than half of the average distance between all existing hubs. The minimized probe conformations are grouped into clusters such that the maximum distance between a cluster's hub and any of its members (the cluster radius) is smaller than half of the average distance between all the existing hubs. Clusters with <20 entries are removed. The clusters are ranked on the basis of their average free energies 
G
i =
jpij
Gj, where pij = exp(
Gj/RT)/Qi and Qi =
j exp(
Gj/RT) is a partition function obtained by summing the Boltzmann factors over the conformations in the ith cluster only. For each probe we retain a number (usually five) of the lowest free-energy clusters. We note that the goal of clustering in this filtering step is simply to reduce the number of isolated minima among the low free-energy conformation retained for further analysis. The clusters of the retained clusters (called consensus sites) are defined as the positions at which the clusters overlap for a number of different probes. The position at which the maximum number of different probes overlap will be referred to as consensus-site number 1, the position with the next highest number of probes consensus-site number 2, and so on (24
).
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Comparing free energy versus clustering ranking of docked conformations
To gauge the benefits of clustering alone on the prediction of protein-protein complexes we have clustered a benchmark set of docked conformations from Weng's lab (28
). The conformations, which are publicly available at http://zlab.bu.edu/
rong/dock/benchmark.shtml, were generated using the software ZDOCK (29
), which includes a scoring function similar to that developed in Camacho et al. (21
). Namely, the data consists of a set of 2000 conformations ranked according to surface complementarity, Coulombic electrostatics, and the atomic-contact potential's desolvation potential.
Table 1 shows a direct comparison of the free-energy-based ranking used by the program ZDOCK and the clustering results of the same set of 2000 docked conformations using ClusPro. The results (published in detail in Ref. 18
) show that clustering alone improves the discrimination of near-native structures by a factor of 3 or more. We note that these results consider protein structures to be rigid bodies. Table 1 includes 42 different protein complexes for which there was a relevant number of hits within 10 Å RMSD from the native complex structure. The clustering radius was set to RC = 9 Å. Strikingly, as long as there are at least 10 hits in the set of 2000 structures, ClusPro is always able to rank a near-native structure within the top 50 predictions.
|
|
Typical cluster size is 9 Å RMSD for protein-protein interactions
The size of the attractor at the binding site is
9 Å, a distance consistent with the range of the desolvation and electrostatic interactions. The half-value of the desolvation potential is reached at 6 Å atomic separation, vanishing at distances larger than 7 Å. Similarly, the half-value of long-range Coulombic interactions (distance-dependent dielectric 4r) is
5 Å, slowly decaying to near-zero at
10 Å (9
). Fig. 4 A shows that the size of the desolvation free-energy clusters is
610 Å, suggesting the presence of relatively broad hydrophobic patches. In all likelihood, desolvation forces will dominate the binding process of these complexes, like for the case of protease inhibitor complex 5cha-2ovo. The clustering peak for the electrostatically filtered data in Fig. 4 B has a range between 5 and 7 Å, somewhat smaller than the range for desolvation interactions. This is due to the rapid decay of the distance-dependent electrostatic field, and also due to the fact that, for unbound structures, the electrostatic field is noisy. From the analysis of Fig. 4, we conclude that, in average, the optimal clustering distance of desolvation and electrostatic filtered complexes is 9 Å. We note that this is the default clustering radius that we set for the automated docked predictions in the ClusPro server.
Optimal clustering radius improves discrimination of near-native docked conformations
The recurrent bimodal distribution observed in the clustering of the pairwise RMSD of filtered low free-energy docked conformations (Fig. 4) confirms that these conformations indeed aggregate around local minima. Namely, they distribute around the free-energy landscape as in the sketch in Fig. 2 A. Although we have already shown that clustering alone significantly improves the discrimination of near-native structures, we now proceed to demonstrate that one could do even better by extracting from the data set the optimal clustering radius that characterizes the free-energy landscape.
Similar to the analysis presented in Table 1, we use Weng's benchmark of 2000 docked conformations of 40 independently crystallized receptor and ligand structures to showcase how the optimal clustering radius can improve discrimination of near-native structures. Fig. 5 A shows the pairwise RMSD distribution for five complexes every 1 Å (see Methods), and the data points are interpolated using a cubic spline function. The pairwise RMSD is calculated on 1200 conformations corresponding to the top 300 desolvation and three-times more (900) electrostatic complexes. As suggested by Fig. 1, clustering too many structures (high free energies) would only add noise to the procedure. On the other hand, too few conformations might lead to many small clusters. We have already established that keeping 2000 low free-energy conformations led to a reasonable sampling of the binding pocket (21
). In Fig. 5 B, we show that, indeed, the clustering property is maintained by keeping between 1000 and 2000 docked conformations.
|
In Table 2, we show both the ranking of the best predictions (<10 Å RMSD from the crystal) using the default clustering radius of 9 Å (see details in Ref. 18
) and the ranking based on the optimal clustering radius as defined by the minimum of the bimodal. Note that from plots like in Fig. 5, it is straightforward to compute the radius and clustering parameter
. Clustering predictions using the optimal radius (ranging between 4 and 10 Å) yields better predictions overall than a fixed radius (default 9 Å); the average ranking is 7 and 8.5, respectively (excluding the outlier 2PCC). Moreover, the deeper the separation between the peaks of the bimodal distribution is the better the predictions. In particular, for
0.4, the ranking of near-native predictions is much better for optimal than for the default clustering radius, with an average ranking in this case of 4.3 and 8.8, respectively. As the peaks start to overlap and
decreases below 0.4, we observe only a partial improvement.
|
Clustering of small molecular probes
Table 3 shows the top three consensus sites for 11 enzymes that we have recently mapped. We list the total number of different probes used for the mapping of each protein, the number of clusters at the consensus sites, and the distance of the center of the consensus site from the substrate-binding site of the enzyme. According to this table, the largest consensus site is located at the active site for all enzymes but haloalkane dehalogenase (26
). The latter binds very small ligands, such as ethylene dichloride, and the binding site is in the middle of a long and narrow channel. Since some of the probes are bigger than the substrate, they are unable to enter the channel, and we find the largest consensus sites at the two ends of the deep internal channel by which the substrate must traverse to the active site.
|
|
|
| CONCLUSIONS |
|---|
|
|
|---|
The most novel aspect of this article is that we show that clustering is not a tool of last resort but in fact it is an intrinsic property of a well sampled free-energy landscape. This is quite evident from the recurrent bimodal distribution observed in the histograms of the pairwise RMSD of docked conformations generated by ClusPro/ZDOCK and computational mapping for protein-protein and protein-small molecule docking, respectively. We show that this distribution, which does not involve any biochemical information, is an important property of a data set that clusters. The clustering radius is consistent with the range of the interactions dominating the binding process, and is well approximated by the minimum between the two peaks of the bimodal distribution. This radius leads to an optimal discrimination of nativelike complex structures when the normalized depth between the two peaks of the distribution
is larger than 0.4.
Our analysis strongly suggests the existence of many structural neighbors around the native state and other local free-energy minima. This clustering is not the result of the particular computational method employed to sample the landscape, but in fact it is due to the biophysics of protein association.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The research has been partially supported by grant No. DBI-0213832 from the National Science Foundation, and grants No. GM64700 and No. GM61867 from the National Institutes of Health. C.J.C. has also received support from National Science Foundation grant No. MCB-0444291.
| FOOTNOTES |
|---|
Submitted on December 28, 2004; accepted for publication May 6, 2005.
| REFERENCES |
|---|
|
|
|---|
2. Shortle, D., K. T. Simons, and D. Baker. 1998. Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA. 95:1115811162.
3. Bystroff, C., V. Thorsson, and D. Baker. 2000. HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301:173190.[CrossRef][Medline]
4. Karplus, K., C. Barrett, and R. Hughey. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 14:846856.
5. Prasad, J. C., S. Comeau, S. Vajda, and C. J. Camacho. 2003. Consensus alignment for reliable framework prediction in homology modeling. Bioinformatics. 19:16821691.
6. Schreiber, G., and A. R. Fersht. 1996. Rapid, electrostatically assisted association of proteins. Nat. Struct. Biol. 3:427431.[CrossRef][Medline]
7. Gabdoulline, R. R., and R. C. Wade. 1997. Simulation of the diffusional association of barnase and barstar. Biophys. J. 72:19171929.
8. Camacho, C. J., Z. P. Weng, S. Vajda, and C. DeLisi. 1999. Free energy landscapes of encounter complexes in protein-protein association. Biophys. J. 76:11661178.
9. Camacho, C. J., S. R. Kimura, C. DeLisi, and S. Vajda. 2000. Kinetics of desolvation-mediated protein-protein binding. Biophys. J. 78:10941105.
10. Camacho, C. J., and S. Vajda. 2001. Protein docking along smooth association pathways. Proc. Natl. Acad. Sci. USA. 98:1063610641.
11. Fernandez-Recio, J., M. Totrov, and R. Abagyan. 2004. Identification of protein-protein interaction sites from docking energy landscapes. J. Mol. Biol. 43:629640.
12. Mattos, C., and D. Ringe. 1996. Locating and characterizing binding sites on proteins. Nat. Biotechnol. 14:595599.[CrossRef][Medline]
13. Allen, K. N., C. R. Bellamacina, X. Ding, C. J. Jeffery, C. Mattos, G. A. Petsko, and D. Ringe. 1996. An experimental approach to mapping the binding surfaces of crystalline proteins. J. Phys. Chem. 100:26052611.[CrossRef]
14. English, A. C., S. H. Done, L. S. Caves, C. R. Groom, and R. E. Hubbard. 1999. Locating interaction sites on proteins: the crystal structure of thermolysin soaked in 2% to 100% isopropanol. Proteins. 37:628640.[CrossRef][Medline]
15. English, A. C., C. R. Groom, and R. E. Hubbard. 2001. Experimental and computational mapping of the binding surface of a crystalline protein. Protein Eng. 14:4759.
16. Liepinsh, E., and G. Otting. 1997. Organic solvents identify specific ligand binding sites on protein surfaces. Nat. Biotechnol. 15:264268.[CrossRef][Medline]
17. Sanschagrin, P. C., and L. A. Kuhn. 1998. Cluster analysis of consensus water sites in thrombin and trypsin shows conservation between serine proteases and contributions to ligand specificity. Protein Sci. 7:20542064.[Abstract]
18. Comeau, S. R., D. Gatchell, S. Vajda, and C. J. Camacho. 2004. ClusPro: an automated docking and discrimination method for the prediction of protein complexes. Bioinformatics. 20:4550.
19. Ten Eyck, L. F., J. Mandell, V. A. Roberts, and M. E. Pique. 1995. Surveying molecular interactions with DOT. In Proceedings of the 1995 ACM/IEEE Supercomputing Conference. A. Hayes and M. Simmons, editors. ACM Press, New York.
20. Zhang, C., G. Vasmatzis, and J. L. Cornette. 1997. Determination of atomic desolvation energies from the structures of crystallized proteins. J. Mol. Biol. 267:707726.[CrossRef][Medline]
21. Camacho, C. J., D. W. Gatchell, S. R. Kimura, and S. Vajda. 2000. Scoring docked conformations generated by rigid-body protein-protein docking. Proteins. 40:525537.[CrossRef][Medline]
22. Camacho, C. J., and D. Gatchell. 2003. Successful discrimination of protein interactions. Proteins. 40:525537.
23. Mendez, R., R. Leplae, L. De Maria, and S. J. Wodak. 2003. Assessment of blind predictions of protein-protein interactions: current status of docking methods. Proteins. 52:5167.[CrossRef][Medline]
24. Dennis, S., T. Kortvelyesi, and S. Vajda. 2002. Computational mapping identifies the binding sites of organic solvents on proteins. Proc. Natl. Acad. Sci. USA. 99:42904295.
25. Kortvelyesi, T., S. Dennis, M. Silberstein, L. Brown III, and S. Vajda. 2003. Algorithms for computational solvent mapping of proteins. Proteins. 51:340351.[CrossRef][Medline]
26. Silberstein, M., S. Dennis, L. Brown III, T. Kortvelyesi, K. Clodfelter, and S. Vajda. 2003. Identification of substrate binding sites in enzymes by computational solvent mapping. J. Mol. Biol. 332:10951113.[CrossRef][Medline]
27. Vakser, I. A. 1995. Protein docking for low-resolution structures. Protein Eng. 8:371377.
28. Chen, R., J. Mintseris, J. Janin, and Z. Weng. 2003. A protein-protein docking benchmark. Proteins. 52:8891.[CrossRef][Medline]
29. Chen, R., and Z. Weng. 2003. A novel shape complementarity scoring function for protein-protein docking. Proteins. 51:397408.[CrossRef][Medline]
30. Graille, M., C. Z. Zhou, V. Receveur, B. Collinet, N. Declerck, and H. van Tilbeurgh. 2005. Activation of the LicT transcriptional antiterminator involves a domain swing/lock mechanism provoking massive structural changes. J. Biol. Chem. 280:1478014789.
31. Takagi, J., Y. Yang, J. H. Liu, J. H. Wang, and T. A. Springer. 2003. Complex between nidogen and laminin fragments reveals a paradigmatic ß-propeller interface. Nature. 424:969974.[CrossRef][Medline]
32. Rajamani, D., S. Thiel, S. Vajda, and C. J. Camacho. 2004. Anchor residues in protein-protein interactions. Proc. Natl. Acad. Sci. USA. 101:1128711292.
33. Sheu, S.-H., T. Kaya, D. J. Waxman, and S. Vajda. 2005. Exploring the binding site structure of the PPAR-
ligand binding domain by computational solvent mapping. Biochemistry. 44:11931209.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
V. I. Lesk and M. J. E. Sternberg 3D-Garden: a system for modelling protein-protein complexes based on conformational refinement of ensembles generated with the marching cubes algorithm Bioinformatics, May 1, 2008; 24(9): 1137 - 1144. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |