help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Biophys J. BioFAST on March 13, 2006.
doi:10.1529/biophysj.105.079434
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
biophysj.105.079434v1
90/11/4010    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Dehouck, Y.
Right arrow Articles by Rooman, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dehouck, Y.
Right arrow Articles by Rooman, M.
Biophysical Journal 90:4010-4017 (2006)
© 2006 The Biophysical Society

A New Generation of Statistical Potentials for Proteins

Y. Dehouck, D. Gilis and M. Rooman

Unité de Bioinformatique génomique et structurale, Université Libre de Bruxelles, 1050 Brussels, Belgium

Correspondence: Address reprint requests to Yves Dehouck, E-mail: ydehouck{at}ulb.ac.be.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We propose a novel and flexible derivation scheme of statistical, database-derived, potentials, which allows one to take simultaneously into account specific correlations between several sequence and structure descriptors. This scheme leads to the decomposition of the total folding free energy of a protein into a sum of lower order terms, thereby giving the possibility to analyze independently each contribution and clarify its significance and importance, to avoid overcounting certain contributions, and to deal more efficiently with the limited size of the database. In addition, this derivation scheme appears as quite general, for many previously developed potentials can be expressed as particular cases of our formalism. We use this formalism as a framework to generate different residue-based energy functions, whose performances are assessed on the basis of their ability to discriminate genuine proteins from decoy models. The optimal potential is generated as a combination of several coupling terms, measuring correlations between residue types, backbone torsion angles, solvent accessibilities, relative positions along the sequence, and interresidue distances. This potential outperforms all tested residue-based potentials, and even several atom-based potentials. Its incorporation in algorithms aiming at predicting protein structure and stability should therefore substantially improve their performances.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Somewhere between the time-consuming semiempirical force fields (1Go–3Go) and the oversimplified Go-like potentials (4Go–7Go), statistical energy functions, extracted from databases of known protein structures, are prime tools for the in silico study of proteins (8Go–12Go). They present the advantage of being easily adaptable to any level of simplification of protein representation, and have been successfully used in many applications, ranging from structure prediction to sequence design. Though there has been a considerable increase in the number of resolved protein structures since the first approaches of this type were described, no major improvement in predictive power could be drawn from the larger size of the databases (13Go–15Go). Indeed, increasing the database size beyond a few hundred proteins appears to yield no significant advantage in the case of the simple potentials that are still very commonly used nowadays, which are based on a limited number of sequence and structure descriptors.

In the last few years, a number of more complex potentials have been designed with the aim of exploiting more efficiently the large amount of available structural data and dealing with couplings between different structural features. Among those, let us cite distance or contact potentials that depend on the solvent accessibility of the residues (16Go,17Go), on the conformation of their main chain (18Go), or on the relative orientation of their side chains (19Go–21Go). On the other hand, potentials describing the propensities of the different amino acid types to adopt certain backbone conformations, which simultaneously take into account the nature and/or conformation of several neighboring residues, have also been developed (16Go,22Go,23Go). A major difficulty that frequently arises in such studies is related to the fact that the number of proteins in the database becomes rapidly too small when increasing the complexity of a potential. One faces a delicate choice: the use of a more complex potential can be quite advantageous for common values of the sequence and structure descriptors (e.g., Ala-Ala pair associated with {alpha}-helical conformations), and pretty disastrous in other cases (e.g., Trp-Trp pair associated with some rare turn conformations). The usual answer to this dilemma consists in drastic limitations of the description of the conformational space, for example by restricting the backbone to three possible conformations, the solvent accessibility to two different bins, or by deriving contact potentials rather than distance-dependent ones.

We present here a general derivation scheme that allows one to bypass this issue, and to build statistical energy functions based simultaneously on several sequence and structure descriptors without altering the efficiency of the elementary contributions when the values taken by these descriptors are not frequent enough in the database of known protein structures. We apply our procedure to generate statistical potentials based on the correlations among amino acid types, backbone conformations, and solvent accessibilities of residues close to each other in the sequence and/or in space. The resulting energy function displays a strongly improved ability to discriminate genuine proteins from decoy models. All potentials presented in this article are freely available at http://babylone.ulb.ac.be/StatPots.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Sequence and structure descriptors
The backbone conformation of the residue at position i, ti, is defined by the values of the torsion angles ({phi},{varphi},{omega}). These values are grouped in seven domains corresponding to distinct regions on the Ramachandran map (22Go,24Go). The solvent accessibility of the residue at position i, ai, is defined as the ratio of its solvent-accessible surface in the considered structure (as computed by DSSP (25Go)) and in an extended tripeptide Gly-X-Gly (26Go). These values are grouped in five discrete domains: ai ≤ 5%, 5% < ai ≤ 15%, 15% < ai ≤ 30%, 30% < ai ≤ 50%, and 50% < ai . The interresidue distance dij is computed between the average side-chain centroids, noted Cµ, of the residues at positions i and j. The Cµ corresponds to the geometric center of heavy side-chain atoms of a given amino acid type, averaged over all side-chain conformations in a data set of known structures (16Go). The distances dij between 3 Å and 8 Å are grouped into 25 bins of 0.2 Å width; two additional bins describe distances smaller than 3 Å and larger than 8 Å, respectively. Finally, the sequence descriptor si corresponds to the nature (1 of 20 amino acids) of the residue at position i.

Protein structure data set
An initial set of 1522 high-resolution (≤2 Å) x-ray structures of protein chains with <20% pairwise sequence identity was extracted in October 2003 from the website "Culling the PDB by Resolution and Sequence Identity" (27Go) (http://dunbrack.fccc.edu/Guoli/pisces_download.php). All structures containing more than 5% heteroatoms or nonnatural residues were excluded. This led to a final set of 1403 protein chains. Furthermore, to ensure that the data set used to derive the potentials includes the proper, active, quaternary conformations of the selected proteins, the coordinates were taken from the "Protein Quaternary Structure" server (28Go) (http://pqs.ebi.ac.uk).

Correction for sparse data
All database-derived potentials and coupling terms presented here can be generically written as {Delta}W = –kT ln (nobs/nexp), where nobs is the number of observations of a given association of sequence and structure descriptors in the data set of known protein structures, and nexp is the corresponding number expected in a reference state. To deal with the limited size of the data set, a correction for sparse data (29Go) is applied: (nobs/nexp) -> (({sigma} + nobs)/({sigma} + nexp)), where {sigma} is an adjustable parameter, taken equal to 20 for local potentials, and 10 for distance potentials (see Results for the definition of local and distance potentials). This correction ensures that the potentials tend to 0 when the number of observations in the data set is too small.

Decoy sets
To assess the performances of the potentials, we evaluate their ability of singling out correct sequence-structure matches out of sets of decoy models. Three groups of decoys sets are considered. The first, noted Formula, includes 25 proteins (30Go,31Go), each associated with hundreds of alternative structures generated by different modeling methods (4state_reduced (32Go): 1ctf, 1r69, 1sn3, 2cro, 4pti and 4rxn ; fisa (33Go): 1fc2-c, 1hdd-c, 2cro ; fisa_casp3 (33Go): 1bg8-a, 1bl0, 1jwe ; lattice-ssfit (31Go): 1ctf, 1dkt-a, 1fca, 1nlk, 1pgb, 1trl-a ; lmds (34Go): 1ctf, 1dtk, 1fc2-c, 1igd, 1shf-a, 2cro, 2ovo). The second group, noted Formula includes 25 proteins (35Go), each associated with ~2000 alternative structures generated by the Rosetta structure prediction method (1a32, 1ail, 1am3, 1cc5, 1cei, 1hyp, 1flb, 1mzm, 1r69, 1utg, 1ctf, 1dol, 1orc, 1pgx, 1ptq, 1tif, 1vcc, 2fxb, 5icb, 1bq9, 1csp, 1msi, 1tuc, 1vif, 5pti). The third group, noted Dseq, includes 50 proteins (1ptq, 1d0d, 2igd, 1g2b, 1orc, 1hz6, 1i27, 1hoe, 1luz, 1ugi, 1aba, 1cy5, 1lpl, 1mk0, 1h7m, 1bm8, 1l8r, 1lyq, 1o13, 1gmx, 1cew, 1hxi, 1nyc, 1by2, 1lsl, 1o7i, 1gnu, 1fc3, 1mai, 1dzo, 1lwb, 1huf, 1nwz, 3nul, 1cuo, 1jf8, 1p0z, 1mdc, 1vsr, 1gmi, 1eca, 1j9b, 1kmt, 1mzg, 1oz9, 1h6h, 1l2h, 1srv, 2hbg, 1amx), each associated with 1000 decoys obtained by maintaining the structure and randomizing the amino acid sequence with fixed amino acid composition. To render the test more challenging, only a fraction of the sequence was modified. This fraction was chosen randomly between 25% and 100%, independently for each decoy.

To avoid any bias toward the native structure or wild-type sequence that might result from the presence of similar proteins in the data set, an extended jackknife procedure is applied: we remove the target protein, as well as all proteins sharing more than 20% sequence identity with the target, from the database before deriving the potentials.

Performance measures
We use five different measures to evaluate the ability of the potentials to discriminate the native structure from the decoys:

  1. The success rate S1 is the percentage of proteins, in each group of decoys, for which the free energy of the correct sequence-structure association is smaller than the free energies computed for all decoys.
  2. <Z> is the average Z-score, over all proteins in a group of decoys. The Z-score is defined as Z = ({Delta}Wc<{Delta}W>)/{sigma}{Delta}W, where {Delta}Wc is the free energy of the correct sequence-structure association, <{Delta}W> is the average free energy of all sequence-structure associations, and {sigma}{Delta}W is the associated standard deviation. Energy functions discriminating well the genuine protein from the decoys are characterized by a very negative Z-score.
  3. S–1 is the percentage of proteins with a Z-score lower than –1 (19Go). This measure may be more useful than S1 when the test is challenging, for instance when the decoys and the native structures or sequences are very similar.
  4. <Zx> evaluates the ability of the potentials to select the decoys that are closest from the native among the complete decoy set. Zx is defined as (<{Delta}W>5%<{Delta}W>)/{sigma}{Delta}W, where <{Delta}W>5% is the average free energy computed on a subset including 5% of the decoys (19Go). This subset contains the decoys with the lowest root mean-square deviation from the native structure, or the decoys with the largest sequence identity with the wild-type in the case of decoys generated by sequence randomization.
  5. Formula is equal to the percentage of proteins for which Zx is lower than –1 (19Go).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
General derivation scheme
A form commonly used for statistical potentials derived from a set of protein structures is

Formula 1(1)
where c1 is an amino acid type and c2 a structure descriptor (e.g., a torsion angle or solvent accessibility domain) of the same or a neighboring residue, and P are their relative frequencies of occurrence in the structure data set. Similarly, considering two sequence descriptors c1 and c2 and one structure descriptor c3, we have

Formula 2(2)
where, for example, c1 and c2 are amino acid types at positions i and j along the sequence and c3 is the spatial distance between them.

This form can easily be generalized. First, c1, c2, and c3 can be any sequence or structure descriptor. For example, all three can correspond to torsion angle domains, or c1 can correspond to an amino acid type, c2 to a solvent accessibility domain, and c3 to a torsion angle domain. A second way to generalize this form is to consider higher order potentials involving n sequence and structure descriptors. We then get

Formula 3(3)
Increasing n reduces the number of observations of each combination of the ci's in the data set and the statistical significance of the frequencies P(c1,c2,...,cn). When the number of observations is too small, the correction for sparse data (see Methods) becomes important and the potential tends to zero, leading to a complete loss of information. A straightforward solution to this problem involves decomposing the potential into different coupling terms {Delta}Formula 3, and applying the correction for sparse data to each of them separately. In particular, for n = 3:

Formula 4(4)
where the n = 2 coupling terms coincide with the ordinary potentials {Delta}Formula 4(c1,c2) = {Delta}W(c1,c2), and the n = 3 coupling term is defined as

Formula 5(5)
This n = 3 coupling term measures the correlation between the three sequence and structure descriptors c1, c2, and c3, independently of the correlations between c1 and c2, c2 and c3, and c3 and c1. More generally, we can define n-potentials {Delta}W in terms of all k ≤ n coupling terms {Delta}Formula 5:

Formula 6(6)
where the n-coupling terms describing correlations between n descriptors are defined as

Formula 7(7)

To ensure that each contribution is counted only once, the total free energy of a protein of sequence S and structure C, {Delta}W(C,S), is defined as the sum of the total contributions of all coupling terms of order k ≤ n:

Formula 8(8)
where the third sum goes over all combinations of the (ci1,ci2,...,cik) descriptors present in the protein. The value chosen for n depends on the structural descriptors and the level of detail that one wishes to take into account, and also on the limitations arising from the finite size of the database.

Note that it is not always necessary or advantageous to fully decompose the potential functions like in Eqs. 4 and 6. In particular, the coupling terms of the type {Delta}Formula 8(s1,s2), with s1 and s2 being single residues, may reasonably be overlooked. For example, a relevant and commonly used distance potential {Delta}W' (s1,s2,d12) may be defined as

Formula 9(9)
More generally, we denote by {Delta}W' potentials comprising only some of the couplings included in {Delta}W.

Local potentials and couplings
A first application of our general derivation scheme consists in defining local potentials reflecting the correlations among characteristics of residues that are close to each other along the sequence. We focus here on three different residue characteristics: its type s, its backbone conformation t, and its solvent accessibility a (see Methods).

Among the local n = 2 coupling terms of the type {Delta}Formula 9(c1,c2) defined in Eqs. 1 and 7, let us consider first {Delta}Formula 9ts(ti,sj), where c1 is taken to be the backbone conformation of the residue at position i (ti) and c2 the type of the residue at position j (sj). We assume that this effective energy depends only on the relative positions of the residues along the sequence (ij), and not on the precise positions i and j. The total free energy of a given sequence S in a structure C, according to this potential, is computed by summing {Delta}Formula 9ts(ti,sj) over all pairs of positions i and j in S that satisfy the condition |ij| ≤ FLOC, where FLOC is an adjustable parameter taken here equal to 2. This energy function is similar to previously described backbone torsion potentials (16Go,22Go,23Go,36Go). We also compute all other n = 2 coupling terms (except {Delta}Formula 9ss(si,sj), which depends only on the sequence), i.e., {Delta}Formula 9as(ai,sj), {Delta}Formula 9at(ai,tj), {Delta}Formula 9aa(ai,aj) and {Delta}Wtt(ti,tj). Note that when c1 and c2 correspond to the same structure or sequence descriptor, the condition |ij| ≤ FLOC becomes 1 ≤ ij ≤ FLOC.

We would like to stress that summing the energy contributions of all pairs (c1,c2) yields only an approximation of the total free energy of a protein. Indeed, the contributions {Delta}Formula 9ts(ti,sj) and {Delta}Formula 9ts(ti,sk) are in general not independent. Moreover, using simultaneously {Delta}Formula 9ts(ti,sj) and {Delta}Formula 9as(ai,sj) can be advantageous but introduces some redundancy since the solvent accessibility of a residue is related to its backbone conformation. To overcome these dependencies, we must add the n = 3 coupling terms {Delta}Formula 9tts(ti,tj,sk), {Delta}Formula 9tss(ti,sj,sk), {Delta}Formula 9ttt(ti,tj,tk), {Delta}Formula 9aas(ai,aj,sk), {Delta}Formula 9ass(ai,sj,sk), {Delta}Formula 9aaa(ai,aj,ak), {Delta}Formula 9aat(ai,aj,tk), {Delta}Formula 9att(ai,tj,tk) and {Delta}Formula 9ats(ai,tj,sk). They are defined on the basis of Eq. 5 so as to be additive to, and exclusive of, the lower order coupling terms (Eq. 4). The interdependence of the different n = 3 coupling terms can, in turn, be corrected by the use of n = 4 coupling terms.

We assessed the predictive power of the different n = (2Go,3Go,4Go) coupling terms, independently and in combination, on the three groups of decoy sets described in Methods. The performance measures obtained are given in Table 1 for the basic potentials {Delta}Formula 9ts and {Delta}Formula 9as and for the most efficient linear combination of the local coupling terms, named {Delta}W'LOC:

Formula 10(10)
Overall, the predictive power of {Delta}W'LOC is quite impressive: each performance measure indicates a markedly better discrimination of the correct sequence-structure association than with the basic potentials. The only exception is Formula 10 which slightly decreases in the Dseq set.


View this table:
[in this window]
[in a new window]
 
TABLE 1  Performances of local and distance potentials and couplings

 
Strikingly, {Delta}W'LOC includes almost all n = 2 and n = 3 coupling terms. The only exception is {Delta}Formula 10aa, which systematically drags down the predictive power when included in a combination of coupling terms. This follows from the fact that {Delta}Formula 10aa strongly favors situations in which residues close to each other in the sequence have similar solvent accessibilities, and therefore awards very negative energies to (partially) unfolded proteins. The best combination incorporates also several n = 4 coupling terms: {Delta}Formula 10ttts, {Delta}Formula 10aaas, {Delta}Formula 10attt, {Delta}Formula 10aatt, and {Delta}Formula 10aaat. The other n = 4 coupling terms have a negative impact on the predictive power. This is most probably due to the limited size of the data set, which does not allow one to compute precisely enough the probabilities of observing simultaneously four sequence and/or structure descriptors. Also note that there are 20 types of sequence elements (s), whereas only 7 torsion (t) and 5 accessibility (a) domains. Coupling terms involving several sequence elements, such as {Delta}Formula 10tsss or {Delta}Formula 10asss, do not appear in {Delta}W'LOC as they require larger data sets to extract reliable statistics.

In principle, our derivation scheme does not give any reason to under- or overweight some coupling terms with respect to others. However, some contributions may be less/not relevant and should therefore not be included, for example because of the limited size of the data set (e.g., {Delta}Formula 10tsss, {Delta}Formula 10asss,...), the overstabilization of the unfolded state (e.g., {Delta}Formula 10aa), or the uselessness of purely sequence terms (e.g., {Delta}Formula 10ss). Furthermore, sequence-independent terms can be expected to yield interesting results when discriminating among nonprotein-like structures, and to be quite useless in applications such as threading experiments. Testing the potentials on decoy sets can reasonably well be considered as an intermediate case, which probably explains why we observed that underweighting these contributions by a 1/2 factor, in Eq. 10, is advantageous in terms of predictive power.

Distance potentials and couplings
A very popular category of statistical potentials is derived from the spatial distance distribution between residue types (e.g., 16,17,29,37). They are complementary to the local potentials presented above. It has been previously noted that such potentials do not represent the "true" energy of interaction between two residues (or two atoms) as if they where in a vacuum, but rather an effective energy including the influence of a mean protein and solvent environment (38Go,39Go). As a consequence, these potentials may depend on some characteristics of the proteins from which they are derived, such as their size (40Go–42Go) or their content in secondary structures (14Go,42Go–44Go). The idea of being more precise on the definition of the environment that is actually "felt" by the two interacting residues is not new (16Go–18Go), and can have a positive impact on the performances of the potentials. We show that the formalism presented in this article can be applied to define residue pair distance potentials that take appropriately into account the influence of the specific environment in which the two residues are located. This environment is here represented by backbone conformations and solvent accessibilities.

The n = 2 coupling term {Delta}Formula 10sd(si,dij) is a "one-body" distance potential that reflects the preferences of each type of residue to be located more or less close to other residues, whatever their type, and is therefore dominated by the hydrophobic effect. For residues close to each other along the sequence, i.e., |i–j| ≤ FDIS (taken here equal to 8), the frequencies and potentials are computed separately, whereas they are merged in a single class when |i–j| > FDIS. The total contribution to the free energy of a given sequence S in a structure C is computed by summing {Delta}Formula 10sd(si,dij) over all pairs of positions i and j in S that satisfy the condition |i–j| > 1.

On its own, {Delta}Formula 10sds(si,dij,sj) is a two-body distance potential that excludes the one-body contributions reflecting the individual preferences of the two amino acids si and sj. Such a potential has been presented previously and shown to describe more accurately the electrostatic interactions (42Go). In this case, by reason of symmetry, the condition |i–j| > 1 becomes i–j > 1 when computing the total free energy of a protein. Coupling {Delta}Formula 10sd(si,dij) with {Delta}Formula 10sds(si,dij,sj) yields the common distance potential given in Eq. 9.

In a similar way, it is possible to define sequence-independent distance potentials involving the backbone torsion angles, {Delta}Formula 10td and {Delta}Formula 10tdt, or the solvent accessibilities, {Delta}Formula 10ad and {Delta}Formula 10ada. The concomitant use of these three types of potentials is hazardous since the backbone conformation and solvent accessibility of a residue are clearly dependent on its amino acid type, and some contributions are therefore overcounted. To deal with this problem, we have to define higher order coupling terms. The highest order coupling term is in this case the n = 7 term {Delta}Formula 10atsdats(ai,ti,si,dij,aj,tj,sj). Considering all the lower level coupling terms would lead to a very large number of energetic functions and hamper any intuitive understanding of their significance. Among these, we choose to disregard all distance-independent terms, as they are redundant with the local potentials defined in the previous section for |i–j| ≤ FLOC, and the contributions for other i and j may reasonably be assumed to be negligible. Moreover, to avoid overloading the notations, two-body asymmetrical terms, such as {Delta}Formula 10ads(ai,dij,sj) or {Delta}Formula 10asds(ai,si,dij,sj), are not considered independently but grouped with the closest symmetrical coupling term, here {Delta}Formula 10asdas(ai,si,dij,aj,sj). We thus define {Delta}Wasdas(ai,si,dij,aj,sj) as the sum of {Delta}Formula 10asdas(ai,si,dij,aj,sj) and all the lower order asymmetrical two-body terms. Note finally that, given the limited size of the database, {Delta}Formula 10atsd(ai,ti,si,dij) and {Delta}Watsdats(ai,ti,si,dij,aj,tj,sj) are computed as contact potentials, where dij takes only two possible values: lower or larger than 8 Å.

Overall, according to our performance test on the three groups of decoy sets, the best combination of distance potentials and coupling terms is {Delta}W'DIST, defined as

Formula 11(11)
where the terms {Delta}Formula 11ad and {Delta}Formula 11asd are only included for short-range interactions (SR), that is, when the considered residues are separated by no more than FDIST positions along the sequence. As shown in Table 1, the improvement of the predictive power with respect to the basic distance potential {Delta}Formula 11sd + {Delta}Formula 11sds is substantial in the two decoy sets based on structural modifications (Formula 11 and Formula 11). However, it appears that {Delta}Formula 11sd + {Delta}Formula 11sds performs slightly better than {Delta}W'DIST in the third decoy set. Since these decoys are obtained by modifications of the sequence, the sequence-independent terms ({Delta}Formula 11td, {Delta}Formula 11tdt, {Delta}Formula 11ad,...) are not taken into account in the evaluation of the energies, which may limit the necessity of using coupling terms such as {Delta}Formula 11tsd, {Delta}Formula 11asd, or {Delta}Formula 11tsdts.

Interestingly, as with {Delta}W'LOC, almost all coupling terms are included in the best performing combination, {Delta}W'DIST. This provides a strong support to the legitimacy of our derivation procedure. The only exceptions are {Delta}Formula 11ada and {Delta}Watsdats. The former strongly favors situations where residues close in space have similar solvent accessibilities, which is a characteristic of both folded and unfolded states. The relevance of the latter is obviously compromised by the limited size of the data set. On the other hand, the terms {Delta}Formula 11ad and {Delta}Formula 11asd are only included for short-range interactions. Indeed, for long-range interactions, the separation in sequence is not explicitly taken into account, and {Delta}Formula 11ad merely reflects a trivial correlation: residues with a higher solvent accessibility have fewer contacts with other residues. For those residue pairs that do not benefit from the {Delta}Formula 11ad term, it also appears that {Delta}Formula 11asd is unnecessary, as its aim is to uncouple {Delta}Formula 11ad and {Delta}Formula 11sd.

Combination of local and distance potentials
The combination of the best performing local and distance potentials, {Delta}W'LOC and {Delta}W'DIST, improves their individual scores, as seen in Table 1. We did not address explicitly the issue of possible redundancies between these two types of potentials. However, in itself, the use of distance coupling terms significantly limits this problem. For example, a relatively strong correlation is observed between {Delta}Formula 11as and {Delta}Formula 11sd, but {Delta}Formula 11as and ({Delta}Formula 11sd + {Delta}Formula 11asd) are only weakly correlated. Overall, the performances of the combination {Delta}W'LOC + {Delta}W'DIST are very impressive, as exemplified by average Z-scores of –5.25, –2.65, and –2.74, on the three groups of decoy sets.

Comparison with other statistical potentials
A large number of knowledge-based potentials reflecting the preferences of the different amino acids (or of short stretches of amino acids) to adopt particular local conformations (16Go,22Go,23Go,36Go), to be more or less accessible to the solvent (16Go,17Go,45Go,46Go), or to be separated by a given spatial distance (16Go,17Go,29Go,30Go,37Go) have been described in the literature. However, to our knowledge, our approach is the first to integrate all these different types of contributions in a single energetic function while taking special care of their couplings. Moreover, on the local level, the nonadditivity of contributions related to pairs of residues, such as {Delta}Formula 11ts(ti,sj) and {Delta}Formula 11ts(ti,sk), is taken care of by the use of higher order coupling terms ({Delta}Formula 11tss(ti,sj,sk), {Delta}Formula 11tts(ti,tj,sk),...).

Among the local potentials based on backbone torsion angles that have been described earlier, let us cite the residue-to-torsion (22Go) and the torsion-to-residue (16Go) potentials, developed by one of us. As seen in Table 2, (a) and (b), both potentials can be expressed as simple combinations of the coupling terms {Delta}Formula 11ts, {Delta}Formula 11tss, and {Delta}Formula 11tts. Miyazawa and Jernigan designed a more complex torsion potential (23Go), based on a reference state that is quite different from ours and on different values of the structural descriptors. A rigorous comparison of the two approaches is therefore difficult. However, a common feature is the expression of the energetic function as a sum of basic potentials and of higher order coupling terms defined so as to exclude the more basic contributions. In this sense, their potential can be compared to the combination of coupling terms {Delta}Formula 11 given in Table 2 (c).


View this table:
[in this window]
[in a new window]
 
TABLE 2  Correspondence with other statistical potentials

 
Most commonly used distance and contact potentials (16Go,29Go,37Go) can be written as a simple sum of {Delta}Formula 11 coupling terms as described in Eq. 9, sometimes with a different reference state. In addition, more sophisticated distance potentials that take into account the solvent accessibilities or the conformations of the residues also appear as particular cases of our formalism. A first example is the "Cµ-Cµ core/surface" potential of Kocher et al. (16Go), which is derived separately for residue pairs that are buried or on the surface of the protein (Table 2 (d)). In the same line of thought, the energy function presented by Simons et al. (17Go) is composed of an environment term, comparable to {Delta}Formula 11as (with FLOC = 0), and a pair term based on the spatial distance separating two residues in specific environments and designed to avoid redundancy with the environment term. This energy function is equivalent to the combination given in Table 2 (e), where {Delta}Formula 11ass and {Delta}Formula 11asas are distance-independent contributions included in the distance potential, which do not correspond to local potentials since the sequence separation ij is not taken into account. Furthermore, Zhang and Kim estimated contact energies between residue pairs, depending on the conformations of their main chain (ERCE: Environment-Independent Residue Contact Energies) (18Go). To do this, they combined the 20 amino acid types with 3 structural states ({alpha}-helix, ß-sheet, and turn) to define an extended 60-residue alphabet. This approach can easily be translated into a combination of {Delta}Formula 11 coupling terms, as described in Table 2 (f). Finally, several authors derived distance potentials from data sets containing only {alpha}- or only ß-proteins (14Go,42Go–44Go). The basic potential defined in Eq. 9, when derived separately on a subset of the database ({alpha}- or ß-proteins), becomes –kT ln(P(si,sj,dij|ti,tj)/P(si,sj|ti,tj)P(dij|ti,tj)), where (ti,tj) refers to the global secondary structure content of the protein. With such a definition, this distance potential is equivalent to the combination given in Table 2 (g).

Regarding the increase in performances provided by our new derivation scheme, the results summarized in Table 1 are unambiguous: {Delta}W'LOC, {Delta}W'DIST, and especially {Delta}W'LOC + {Delta}W'DIST are superior to common distance and local potentials such as {Delta}Formula 11sd + {Delta}Formula 11sds, {Delta}Formula 11as, and {Delta}Formula 11ts. This comparison can be considered as fair, given that all these potentials are derived from the same data set, using the same type of reference state, structural descriptors, and adjustable parameters. Another way to assess the performances of the potentials is to look at previously published tests on the same groups of decoy sets. This comparison has nevertheless the drawback that the effects of derivation scheme, reference state, and other parameters are mixed.

Several potentials have been tested on the group of decoy sets Formula 11 (30Go,47Go); the results are summarized in Table 3. According to this test, our distance potential {Delta}W'DIST is clearly superior to every other residue-based distance or contact potential given in Table 3, as indicated by all available measures except S–1 in the case of TE-13 and DFIRE-B. This difference is even more manifest when we consider the combination {Delta}W'LOC + {Delta}W'DIST. Table 3 also suggests that atom-based potentials perform on the average better than potentials considering only one interaction center per residue. Even so, the residue-based combination {Delta}W'DIST appears markedly more efficient than the RAPDF and KBP potentials. The good performances of the potentials DFIRE-A and DFIRE-B seem to result from the use of a particular reference state, defined in such a way that the effective energy associated to a pair of atoms (or residues) tends to zero when the distance separating them approaches 15 Å (47Go). Let us also note that another statistical potential, based on a detailed (atomic) representation of protein structures and designed to describe H-bonds as precisely as possible, has been recently tested on the Formula 11 group of decoy sets (19Go). The results were slightly better than with our potentials (<Z> = –3.34 and S–1 = 92%, whereas <Z> = –2.65 and S–1 = 92% are obtained with {Delta}W'LOC + {Delta}W'DIST). It is not surprising that better predictive capabilities can be obtained with potentials based on a more detailed structural representation, but it should be stressed that a higher level of detail inevitably induces drastic limitations of the application possibilities.


View this table:
[in this window]
[in a new window]
 
TABLE 3  Comparison with the performances of other statistical potentials

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The most exciting result of this study is the definition of a general derivation scheme that allows one to define statistical potentials taking into account the interdependence of correlations among several different sequence or structure descriptors. To demonstrate its interest, we applied this formalism and generated combinations of local and distance potentials that perform strikingly well in discriminating genuine proteins from decoy models.

Our derivation scheme is mainly based on the decomposition of a complex potential into a sum of lower order terms, through the expression of products of probabilities. This decomposition gives the possibility to analyze independently each contribution and clarify its significance and importance. It also offers several valuable advantages in terms of predictive power. First of all, according to the choice of the sequence/structure descriptors, the decomposition may be absolutely necessary to avoid overcounting certain contributions. To clarify this point, let us focus on the correlations between one residue type, s, and two backbone conformations, t. The correct contribution to the total free energy of a protein is given by Eq. 8, in this particular case: {Delta}Formula 11tts(C,S) = {Sigma}i,j {Delta}Formula 11ts(ti,sj) + {Sigma}i,j {Delta}Formula 11tt(ti,tj) + {Sigma}i,j,k{Delta}Formula 11tts(ti,tj,sk). In contrast, if the potential function {Delta}Formula 11tts(ti,tj,sk) was not decomposed and was summed over all triplets of positions (i,j,k), each {Delta}Formula 11ts and {Delta}Formula 11tt contribution would be counted several times.

Secondly, the decomposition we propose allows one to deal much more efficiently with the limited size of the database since the correction for sparse data (see Methods) is applied to each coupling term rather than on the whole energy function. For example, the distance potential {Delta}Watsdats(ai,ti,si,dij,aj,tj,sj) can be expressed as a sum of many n-coupling terms, ranging from n = 2 to n = 7, or computed directly from Eq. 3. If the database is large enough, these two possibilities are equivalent. But if the number of observations of a given combination of values of (ai,ti,si,dij,aj,tj,sj) is too small, the correction for sparse data will make {Delta}Formula 11atsdats(ai,ti,si,dij,aj,tj,sj) tend to zero, but not {Delta}Watsdats(ai,ti,si,dij,aj,tj,sj) unless it is computed directly through Eq. 3. In the latter case, the fact that the database is too small to reliably extract the higher order couplings actually leads to a consequent loss of valuable information about the lower order contributions. Finally, the decomposition makes it possible to modulate the reference state, by excluding some contributions (such as {Delta}Formula 11aa, {Delta}Formula 11ada,...) that do not appear to be relevant and decrease the overall predictive power.

The comparison with other potentials described in the literature underlines the generality of our approach, for previous potentials based on several sequence or structure descriptors can be expressed as particular cases of our formalism. This comparison also shows that we significantly raised the expectations regarding the predictive power of residue-based potentials. Indeed, our energetic functions even outperform some potentials that are based on a more detailed representation of protein structures at the atomic level.

Several improvements may still be envisaged. Indeed, our derivation scheme can easily be adapted to develop energy functions dealing with a more detailed representation of protein structures, or based on another, possibly more relevant, reference state. It is also straightforward to include additional structural descriptors, reflecting, for example, the relative orientations of interacting side chains or the relative positions of triplets of residues.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank J. M. Kwasigroch for his help with computers and web servers, and acknowledge support from the Communauté Française de Belgique through the Action de Recherche Concertée No. 02/07-289, and from the European Community through the Concerted Action Quality of Life 2001-3-8.4.

M.R. is research director at the Belgian National Fund for Scientific Research.

Submitted on December 9, 2005; accepted for publication February 28, 2006.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
1. Brooks, B. R., R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. 1983. CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4:187–217.[CrossRef]

2. Halgren, T. A. 1995. Potential energy functions. Curr. Opin. Struct. Biol. 5:205–210.[CrossRef][Medline]

3. Mackerell, A. D., Jr. 2004. Empirical force fields for biological macromolecules: overview and issues. J. Comput. Chem. 25:1584–1604.[CrossRef][Medline]

4. Go, N. 1983. Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12:183–210.[CrossRef][Medline]

5. Galzitskaya, O. V., and A. V. Finkelstein. 1999. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl. Acad. Sci. USA. 96:11299–11304.[Abstract/Free Full Text]

6. Alm, E., and D. Baker. 1999. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl. Acad. Sci. USA. 96:11305–11310.[Abstract/Free Full Text]

7. Munoz, V., and W. A. Eaton. 1999. A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl. Acad. Sci. USA. 96:11311–11316.[Abstract/Free Full Text]

8. Wodak, S., and M. Rooman. 1993. Generating and testing protein folds. Curr. Opin. Struct. Biol. 3:249–259.

9. Sippl, M. J. 1995. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 5:229–235.[CrossRef][Medline]

10. Jernigan, R. L., and I. Bahar. 1996. Structure-derived potentials and protein simulations. Curr. Opin. Struct. Biol. 6:195–209.[CrossRef][Medline]

11. Moult, J. 1997. Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 7:194–199.[CrossRef][Medline]

12. Russ, W. P., and R. Ranganathan. 2002. Knowledge-based potential functions in protein design. Curr. Opin. Struct. Biol. 12:447–452.[CrossRef][Medline]

13. Miyazawa, S., and R. L. Jernigan. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256:623–644.[CrossRef][Medline]

14. Furuichi, E., and P. Koehl. 1998. Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins. 31:139–149.[CrossRef][Medline]

15. Melo, F., R. Sanchez, and D. Sali. 2002. Statistical potentials for fold assessment. Protein Sci. 11:430–448.[Abstract/Free Full Text]

16. Kocher, J.-P., M. J. Rooman, and S. J. Wodak. 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235:1598–1613.[CrossRef][Medline]

17. Simons, K. T., C. Kooperberg, E. Huang, and D. Baker. 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268:209–225.[CrossRef][Medline]

18. Zhang, C., and S.-H. Kim. 2000. Environment-dependent residue contact energies for proteins. Proc. Natl. Acad. Sci. USA. 97:2550–2555.[Abstract/Free Full Text]

19. Kortemme, T., A. V. Morozov, and D. Baker. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol. 326:1239–1259.[CrossRef][Medline]

20. Buchete, N. V., J. E. Straub, and D. Thirumalai. 2004. Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci. 13:862–874.[Abstract/Free Full Text]

21. Miyazawa, S., and R. L. Jernigan. 2005. How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins. J. Chem. Phys. 122:24901–24918.[CrossRef]

22. Rooman, M. J., J.-P. A. Kocher, and S. J. Wodak. 1991. Prediction of backbone conformation based on seven structure assignments. Influence of local interactions. J. Mol. Biol. 221:961–979.[CrossRef][Medline]

23. Miyazawa, S., and R. L. Jernigan. 1999. Evaluation of short-range interactions as secondary structure energies for protein fold and sequence recognition. Proteins. 36:347–356.[CrossRef][Medline]

24. Ramachandran, G., and V. Sasilekharan. 1968. Conformation of peptides and proteins. Adv. Protein Chem. 23:283–438.[Medline]

25. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22:2577–2637.[CrossRef][Medline]

26. Rose, G. D., A. R. Geselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus. 1985. Hydrophobicity of amino acid residues in globular proteins. Science. 229:834–838.[Abstract/Free Full Text]

27. Wang, G., and R. Dunbrack. 2003. PISCES: a protein sequence culling server. Bioinformatics. 19:1589–1591.[Abstract/Free Full Text]

28. Hendrick, K., and J. M. Thornton. 1998. PQS: a protein quaternary structure file server. Trends Biochem. Sci. 23:358–361.[CrossRef][Medline]

29. Sippl, M. J. 1990. Calculation of conformational ensemble from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213:859–883.[Medline]

30. Tobi, D., and R. Elber. 2000. Distance-dependent, pair potential for protein folding: results from linear optimization. Proteins. 41:40–46.[CrossRef][Medline]

31. Samudrala, R., and M. Levitt. 2000. Decoys‘R’Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9:1399–1401.[Abstract]

32. Park, B., and M. Levitt. 1996. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258:367–392.[CrossRef][Medline]

33. Simons, K. T., I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker. 1999. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins. 34:82–95.[CrossRef][Medline]

34. Keasar, C., and M. Levitt. 2003. A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329:159–174.[CrossRef][Medline]

35. Tsai, J., R. Bonneau, A. V. Morozov, B. Kuhlman, C. A. Rohl, and D. Baker. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins. 53:76–87.[CrossRef][Medline]

36. Kang, H. S., A. Kurochkina, and B. Lee. 1993. Estimation and use of protein backbone angle probabilities. J. Mol. Biol. 229:448–460.[CrossRef][Medline]

37. Bahar, I., and R. L. Jernigan. 1997. Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266:195–214.[CrossRef][Medline]

38. Zhang, L., and J. Skolnick. 1996. How do potentials derived from structural databases relate to "true" potentials. Protein Sci. 7:1201–1207.

39. Shan, Y., and H.-X. Zhou. 2000. Correspondence of potentials of mean force in proteins and in liquids. J. Chem. Phys. 113:457–469.[CrossRef]

40. Thomas, P. D., and K. A. Dill. 1996. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 257:457–469.[CrossRef][Medline]

41. Dehouck, Y., D. Gilis, and M. Rooman. 2004. Database-derived potentials dependent on protein size for in silico folding and design. Biophys. J. 87:171–181.[Abstract/Free Full Text]

42. Rooman, M., and D. Gilis. 1998. Different derivations of knowledge-based potentials and analysis of their robustness and context-dependent predictive power. Eur. J. Biochem. 254:135–143.[Medline]

43. Godzik, A., A. Kolinski, and J. Skolnick. 1995. Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets. Protein Sci. 4:2107–2117.[Abstract]

44. Zhang, C., S. Liu, H. Zhou, and Y. Zhou. 2004. The dependence of all-atom statistical potentials on structural training database. Biophys. J. 86:3349–3358.[Abstract/Free Full Text]

45. Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253:164–170.[Abstract/Free Full Text]

46. Summa, C. M., M. Levitt, and W. F. DeGrado. 2005. An atomic environment potential for use in protein structure prediction. J. Mol. Biol. 352:986–1001.