| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unité de Bioinformatique génomique et structurale, Université Libre de Bruxelles, 1050 Brussels, Belgium
Correspondence: Address reprint requests to Yves Dehouck, E-mail: ydehouck{at}ulb.ac.be.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
-like potentials (4
In the last few years, a number of more complex potentials have been designed with the aim of exploiting more efficiently the large amount of available structural data and dealing with couplings between different structural features. Among those, let us cite distance or contact potentials that depend on the solvent accessibility of the residues (16
,17
), on the conformation of their main chain (18
), or on the relative orientation of their side chains (19
21
). On the other hand, potentials describing the propensities of the different amino acid types to adopt certain backbone conformations, which simultaneously take into account the nature and/or conformation of several neighboring residues, have also been developed (16
,22
,23
). A major difficulty that frequently arises in such studies is related to the fact that the number of proteins in the database becomes rapidly too small when increasing the complexity of a potential. One faces a delicate choice: the use of a more complex potential can be quite advantageous for common values of the sequence and structure descriptors (e.g., Ala-Ala pair associated with
-helical conformations), and pretty disastrous in other cases (e.g., Trp-Trp pair associated with some rare turn conformations). The usual answer to this dilemma consists in drastic limitations of the description of the conformational space, for example by restricting the backbone to three possible conformations, the solvent accessibility to two different bins, or by deriving contact potentials rather than distance-dependent ones.
We present here a general derivation scheme that allows one to bypass this issue, and to build statistical energy functions based simultaneously on several sequence and structure descriptors without altering the efficiency of the elementary contributions when the values taken by these descriptors are not frequent enough in the database of known protein structures. We apply our procedure to generate statistical potentials based on the correlations among amino acid types, backbone conformations, and solvent accessibilities of residues close to each other in the sequence and/or in space. The resulting energy function displays a strongly improved ability to discriminate genuine proteins from decoy models. All potentials presented in this article are freely available at http://babylone.ulb.ac.be/StatPots.
| METHODS |
|---|
|
|
|---|
,
,
). These values are grouped in seven domains corresponding to distinct regions on the Ramachandran map (22
5%, 5% < ai
15%, 15% < ai
30%, 30% < ai
50%, and 50% < ai . The interresidue distance dij is computed between the average side-chain centroids, noted Cµ, of the residues at positions i and j. The Cµ corresponds to the geometric center of heavy side-chain atoms of a given amino acid type, averaged over all side-chain conformations in a data set of known structures (16
Protein structure data set
An initial set of 1522 high-resolution (
2 Å) x-ray structures of protein chains with <20% pairwise sequence identity was extracted in October 2003 from the website "Culling the PDB by Resolution and Sequence Identity" (27
) (http://dunbrack.fccc.edu/Guoli/pisces_download.php). All structures containing more than 5% heteroatoms or nonnatural residues were excluded. This led to a final set of 1403 protein chains. Furthermore, to ensure that the data set used to derive the potentials includes the proper, active, quaternary conformations of the selected proteins, the coordinates were taken from the "Protein Quaternary Structure" server (28
) (http://pqs.ebi.ac.uk).
Correction for sparse data
All database-derived potentials and coupling terms presented here can be generically written as
W = kT ln (nobs/nexp), where nobs is the number of observations of a given association of sequence and structure descriptors in the data set of known protein structures, and nexp is the corresponding number expected in a reference state. To deal with the limited size of the data set, a correction for sparse data (29
) is applied: (nobs/nexp)
((
+ nobs)/(
+ nexp)), where
is an adjustable parameter, taken equal to 20 for local potentials, and 10 for distance potentials (see Results for the definition of local and distance potentials). This correction ensures that the potentials tend to 0 when the number of observations in the data set is too small.
Decoy sets
To assess the performances of the potentials, we evaluate their ability of singling out correct sequence-structure matches out of sets of decoy models. Three groups of decoys sets are considered. The first, noted
, includes 25 proteins (30
,31
), each associated with hundreds of alternative structures generated by different modeling methods (4state_reduced (32
): 1ctf, 1r69, 1sn3, 2cro, 4pti and 4rxn ; fisa (33
): 1fc2-c, 1hdd-c, 2cro ; fisa_casp3 (33
): 1bg8-a, 1bl0, 1jwe ; lattice-ssfit (31
): 1ctf, 1dkt-a, 1fca, 1nlk, 1pgb, 1trl-a ; lmds (34
): 1ctf, 1dtk, 1fc2-c, 1igd, 1shf-a, 2cro, 2ovo). The second group, noted
includes 25 proteins (35
), each associated with
2000 alternative structures generated by the Rosetta structure prediction method (1a32, 1ail, 1am3, 1cc5, 1cei, 1hyp, 1flb, 1mzm, 1r69, 1utg, 1ctf, 1dol, 1orc, 1pgx, 1ptq, 1tif, 1vcc, 2fxb, 5icb, 1bq9, 1csp, 1msi, 1tuc, 1vif, 5pti). The third group, noted Dseq, includes 50 proteins (1ptq, 1d0d, 2igd, 1g2b, 1orc, 1hz6, 1i27, 1hoe, 1luz, 1ugi, 1aba, 1cy5, 1lpl, 1mk0, 1h7m, 1bm8, 1l8r, 1lyq, 1o13, 1gmx, 1cew, 1hxi, 1nyc, 1by2, 1lsl, 1o7i, 1gnu, 1fc3, 1mai, 1dzo, 1lwb, 1huf, 1nwz, 3nul, 1cuo, 1jf8, 1p0z, 1mdc, 1vsr, 1gmi, 1eca, 1j9b, 1kmt, 1mzg, 1oz9, 1h6h, 1l2h, 1srv, 2hbg, 1amx), each associated with 1000 decoys obtained by maintaining the structure and randomizing the amino acid sequence with fixed amino acid composition. To render the test more challenging, only a fraction of the sequence was modified. This fraction was chosen randomly between 25% and 100%, independently for each decoy.
To avoid any bias toward the native structure or wild-type sequence that might result from the presence of similar proteins in the data set, an extended jackknife procedure is applied: we remove the target protein, as well as all proteins sharing more than 20% sequence identity with the target, from the database before deriving the potentials.
Performance measures
We use five different measures to evaluate the ability of the potentials to discriminate the native structure from the decoys:
Z
is the average Z-score, over all proteins in a group of decoys. The Z-score is defined as Z = (
Wc 
W
)/
W, where
Wc is the free energy of the correct sequence-structure association, 
W
is the average free energy of all sequence-structure associations, and 
W is the associated standard deviation. Energy functions discriminating well the genuine protein from the decoys are characterized by a very negative Z-score.
Zx
evaluates the ability of the potentials to select the decoys that are closest from the native among the complete decoy set. Zx is defined as (
W
5% 
W
)/
W, where 
W
5% is the average free energy computed on a subset including 5% of the decoys (19
is equal to the percentage of proteins for which Zx is lower than 1 (19| RESULTS |
|---|
|
|
|---|
![]() | (1) |
![]() | (2) |
This form can easily be generalized. First, c1, c2, and c3 can be any sequence or structure descriptor. For example, all three can correspond to torsion angle domains, or c1 can correspond to an amino acid type, c2 to a solvent accessibility domain, and c3 to a torsion angle domain. A second way to generalize this form is to consider higher order potentials involving n sequence and structure descriptors. We then get
![]() | (3) |

, and applying the correction for sparse data to each of them separately. In particular, for n = 3:
![]() | (4) |

(c1,c2) =
W(c1,c2), and the n = 3 coupling term is defined as
![]() | (5) |
W in terms of all k
n coupling terms 
:
![]() | (6) |
![]() | (7) |
To ensure that each contribution is counted only once, the total free energy of a protein of sequence S and structure C,
W(C,S), is defined as the sum of the total contributions of all coupling terms of order k
n:
![]() | (8) |
Note that it is not always necessary or advantageous to fully decompose the potential functions like in Eqs. 4 and 6. In particular, the coupling terms of the type 
(s1,s2), with s1 and s2 being single residues, may reasonably be overlooked. For example, a relevant and commonly used distance potential
W' (s1,s2,d12) may be defined as
![]() | (9) |
W' potentials comprising only some of the couplings included in
W.
Local potentials and couplings
A first application of our general derivation scheme consists in defining local potentials reflecting the correlations among characteristics of residues that are close to each other along the sequence. We focus here on three different residue characteristics: its type s, its backbone conformation t, and its solvent accessibility a (see Methods).
Among the local n = 2 coupling terms of the type 
(c1,c2) defined in Eqs. 1 and 7, let us consider first 
ts(ti,sj), where c1 is taken to be the backbone conformation of the residue at position i (ti) and c2 the type of the residue at position j (sj). We assume that this effective energy depends only on the relative positions of the residues along the sequence (ij), and not on the precise positions i and j. The total free energy of a given sequence S in a structure C, according to this potential, is computed by summing 
ts(ti,sj) over all pairs of positions i and j in S that satisfy the condition |ij|
FLOC, where FLOC is an adjustable parameter taken here equal to 2. This energy function is similar to previously described backbone torsion potentials (16
,22
,23
,36
). We also compute all other n = 2 coupling terms (except 
ss(si,sj), which depends only on the sequence), i.e., 
as(ai,sj), 
at(ai,tj), 
aa(ai,aj) and
Wtt(ti,tj). Note that when c1 and c2 correspond to the same structure or sequence descriptor, the condition |ij|
FLOC becomes 1
ij
FLOC.
We would like to stress that summing the energy contributions of all pairs (c1,c2) yields only an approximation of the total free energy of a protein. Indeed, the contributions 
ts(ti,sj) and 
ts(ti,sk) are in general not independent. Moreover, using simultaneously 
ts(ti,sj) and 
as(ai,sj) can be advantageous but introduces some redundancy since the solvent accessibility of a residue is related to its backbone conformation. To overcome these dependencies, we must add the n = 3 coupling terms 
tts(ti,tj,sk), 
tss(ti,sj,sk), 
ttt(ti,tj,tk), 
aas(ai,aj,sk), 
ass(ai,sj,sk), 
aaa(ai,aj,ak), 
aat(ai,aj,tk), 
att(ai,tj,tk) and 
ats(ai,tj,sk). They are defined on the basis of Eq. 5 so as to be additive to, and exclusive of, the lower order coupling terms (Eq. 4). The interdependence of the different n = 3 coupling terms can, in turn, be corrected by the use of n = 4 coupling terms.
We assessed the predictive power of the different n = (2
,3
,4
) coupling terms, independently and in combination, on the three groups of decoy sets described in Methods. The performance measures obtained are given in Table 1 for the basic potentials 
ts and 
as and for the most efficient linear combination of the local coupling terms, named
W'LOC:
![]() | (10) |
W'LOC is quite impressive: each performance measure indicates a markedly better discrimination of the correct sequence-structure association than with the basic potentials. The only exception is
which slightly decreases in the Dseq set.
|
W'LOC includes almost all n = 2 and n = 3 coupling terms. The only exception is 
aa, which systematically drags down the predictive power when included in a combination of coupling terms. This follows from the fact that 
aa strongly favors situations in which residues close to each other in the sequence have similar solvent accessibilities, and therefore awards very negative energies to (partially) unfolded proteins. The best combination incorporates also several n = 4 coupling terms: 
ttts, 
aaas, 
attt, 
aatt, and 
aaat. The other n = 4 coupling terms have a negative impact on the predictive power. This is most probably due to the limited size of the data set, which does not allow one to compute precisely enough the probabilities of observing simultaneously four sequence and/or structure descriptors. Also note that there are 20 types of sequence elements (s), whereas only 7 torsion (t) and 5 accessibility (a) domains. Coupling terms involving several sequence elements, such as 
tsss or 
asss, do not appear in
W'LOC as they require larger data sets to extract reliable statistics.
In principle, our derivation scheme does not give any reason to under- or overweight some coupling terms with respect to others. However, some contributions may be less/not relevant and should therefore not be included, for example because of the limited size of the data set (e.g., 
tsss, 
asss,...), the overstabilization of the unfolded state (e.g., 
aa), or the uselessness of purely sequence terms (e.g., 
ss). Furthermore, sequence-independent terms can be expected to yield interesting results when discriminating among nonprotein-like structures, and to be quite useless in applications such as threading experiments. Testing the potentials on decoy sets can reasonably well be considered as an intermediate case, which probably explains why we observed that underweighting these contributions by a
factor, in Eq. 10, is advantageous in terms of predictive power.
Distance potentials and couplings
A very popular category of statistical potentials is derived from the spatial distance distribution between residue types (e.g., 16,17,29,37). They are complementary to the local potentials presented above. It has been previously noted that such potentials do not represent the "true" energy of interaction between two residues (or two atoms) as if they where in a vacuum, but rather an effective energy including the influence of a mean protein and solvent environment (38
,39
). As a consequence, these potentials may depend on some characteristics of the proteins from which they are derived, such as their size (40
42
) or their content in secondary structures (14
,42
44
). The idea of being more precise on the definition of the environment that is actually "felt" by the two interacting residues is not new (16
18
), and can have a positive impact on the performances of the potentials. We show that the formalism presented in this article can be applied to define residue pair distance potentials that take appropriately into account the influence of the specific environment in which the two residues are located. This environment is here represented by backbone conformations and solvent accessibilities.
The n = 2 coupling term 
sd(si,dij) is a "one-body" distance potential that reflects the preferences of each type of residue to be located more or less close to other residues, whatever their type, and is therefore dominated by the hydrophobic effect. For residues close to each other along the sequence, i.e., |ij|
FDIS (taken here equal to 8), the frequencies and potentials are computed separately, whereas they are merged in a single class when |ij| > FDIS. The total contribution to the free energy of a given sequence S in a structure C is computed by summing 
sd(si,dij) over all pairs of positions i and j in S that satisfy the condition |ij| > 1.
On its own, 
sds(si,dij,sj) is a two-body distance potential that excludes the one-body contributions reflecting the individual preferences of the two amino acids si and sj. Such a potential has been presented previously and shown to describe more accurately the electrostatic interactions (42
). In this case, by reason of symmetry, the condition |ij| > 1 becomes ij > 1 when computing the total free energy of a protein. Coupling 
sd(si,dij) with 
sds(si,dij,sj) yields the common distance potential given in Eq. 9.
In a similar way, it is possible to define sequence-independent distance potentials involving the backbone torsion angles, 
td and 
tdt, or the solvent accessibilities, 
ad and 
ada. The concomitant use of these three types of potentials is hazardous since the backbone conformation and solvent accessibility of a residue are clearly dependent on its amino acid type, and some contributions are therefore overcounted. To deal with this problem, we have to define higher order coupling terms. The highest order coupling term is in this case the n = 7 term 
atsdats(ai,ti,si,dij,aj,tj,sj). Considering all the lower level coupling terms would lead to a very large number of energetic functions and hamper any intuitive understanding of their significance. Among these, we choose to disregard all distance-independent terms, as they are redundant with the local potentials defined in the previous section for |ij|
FLOC, and the contributions for other i and j may reasonably be assumed to be negligible. Moreover, to avoid overloading the notations, two-body asymmetrical terms, such as 
ads(ai,dij,sj) or 
asds(ai,si,dij,sj), are not considered independently but grouped with the closest symmetrical coupling term, here 
asdas(ai,si,dij,aj,sj). We thus define 
asdas(ai,si,dij,aj,sj) as the sum of 
asdas(ai,si,dij,aj,sj) and all the lower order asymmetrical two-body terms. Note finally that, given the limited size of the database, 
atsd(ai,ti,si,dij) and 
atsdats(ai,ti,si,dij,aj,tj,sj) are computed as contact potentials, where dij takes only two possible values: lower or larger than 8 Å.
Overall, according to our performance test on the three groups of decoy sets, the best combination of distance potentials and coupling terms is
W'DIST, defined as
![]() | (11) |

ad and 
asd are only included for short-range interactions (SR), that is, when the considered residues are separated by no more than FDIST positions along the sequence. As shown in Table 1, the improvement of the predictive power with respect to the basic distance potential 
sd + 
sds is substantial in the two decoy sets based on structural modifications (
and
). However, it appears that 
sd + 
sds performs slightly better than
W'DIST in the third decoy set. Since these decoys are obtained by modifications of the sequence, the sequence-independent terms (
td, 
tdt, 
ad,...) are not taken into account in the evaluation of the energies, which may limit the necessity of using coupling terms such as 
tsd, 
asd, or 
tsdts.
Interestingly, as with
W'LOC, almost all coupling terms are included in the best performing combination,
W'DIST. This provides a strong support to the legitimacy of our derivation procedure. The only exceptions are 
ada and 
atsdats. The former strongly favors situations where residues close in space have similar solvent accessibilities, which is a characteristic of both folded and unfolded states. The relevance of the latter is obviously compromised by the limited size of the data set. On the other hand, the terms 
ad and 
asd are only included for short-range interactions. Indeed, for long-range interactions, the separation in sequence is not explicitly taken into account, and 
ad merely reflects a trivial correlation: residues with a higher solvent accessibility have fewer contacts with other residues. For those residue pairs that do not benefit from the 
ad term, it also appears that 
asd is unnecessary, as its aim is to uncouple 
ad and 
sd.
Combination of local and distance potentials
The combination of the best performing local and distance potentials,
W'LOC and
W'DIST, improves their individual scores, as seen in Table 1. We did not address explicitly the issue of possible redundancies between these two types of potentials. However, in itself, the use of distance coupling terms significantly limits this problem. For example, a relatively strong correlation is observed between 
as and 
sd, but 
as and (
sd + 
asd) are only weakly correlated. Overall, the performances of the combination
W'LOC +
W'DIST are very impressive, as exemplified by average Z-scores of 5.25, 2.65, and 2.74, on the three groups of decoy sets.
Comparison with other statistical potentials
A large number of knowledge-based potentials reflecting the preferences of the different amino acids (or of short stretches of amino acids) to adopt particular local conformations (16
,22
,23
,36
), to be more or less accessible to the solvent (16
,17
,45
,46
), or to be separated by a given spatial distance (16
,17
,29
,30
,37
) have been described in the literature. However, to our knowledge, our approach is the first to integrate all these different types of contributions in a single energetic function while taking special care of their couplings. Moreover, on the local level, the nonadditivity of contributions related to pairs of residues, such as 
ts(ti,sj) and 
ts(ti,sk), is taken care of by the use of higher order coupling terms (
tss(ti,sj,sk), 
tts(ti,tj,sk),...).
Among the local potentials based on backbone torsion angles that have been described earlier, let us cite the residue-to-torsion (22
) and the torsion-to-residue (16
) potentials, developed by one of us. As seen in Table 2, (a) and (b), both potentials can be expressed as simple combinations of the coupling terms 
ts, 
tss, and 
tts. Miyazawa and Jernigan designed a more complex torsion potential (23
), based on a reference state that is quite different from ours and on different values of the structural descriptors. A rigorous comparison of the two approaches is therefore difficult. However, a common feature is the expression of the energetic function as a sum of basic potentials and of higher order coupling terms defined so as to exclude the more basic contributions. In this sense, their potential can be compared to the combination of coupling terms 
given in Table 2 (c).
|

coupling terms as described in Eq. 9, sometimes with a different reference state. In addition, more sophisticated distance potentials that take into account the solvent accessibilities or the conformations of the residues also appear as particular cases of our formalism. A first example is the "Cµ-Cµ core/surface" potential of Kocher et al. (16
as (with FLOC = 0), and a pair term based on the spatial distance separating two residues in specific environments and designed to avoid redundancy with the environment term. This energy function is equivalent to the combination given in Table 2 (e), where 
ass and 
asas are distance-independent contributions included in the distance potential, which do not correspond to local potentials since the sequence separation ij is not taken into account. Furthermore, Zhang and Kim estimated contact energies between residue pairs, depending on the conformations of their main chain (ERCE: Environment-Independent Residue Contact Energies) (18
-helix, ß-sheet, and turn) to define an extended 60-residue alphabet. This approach can easily be translated into a combination of 
coupling terms, as described in Table 2 (f). Finally, several authors derived distance potentials from data sets containing only
- or only ß-proteins (14
- or ß-proteins), becomes kT ln(P(si,sj,dij|ti,tj)/P(si,sj|ti,tj)P(dij|ti,tj)), where (ti,tj) refers to the global secondary structure content of the protein. With such a definition, this distance potential is equivalent to the combination given in Table 2 (g).
Regarding the increase in performances provided by our new derivation scheme, the results summarized in Table 1 are unambiguous:
W'LOC,
W'DIST, and especially
W'LOC +
W'DIST are superior to common distance and local potentials such as 
sd + 
sds, 
as, and 
ts. This comparison can be considered as fair, given that all these potentials are derived from the same data set, using the same type of reference state, structural descriptors, and adjustable parameters. Another way to assess the performances of the potentials is to look at previously published tests on the same groups of decoy sets. This comparison has nevertheless the drawback that the effects of derivation scheme, reference state, and other parameters are mixed.
Several potentials have been tested on the group of decoy sets
(30
,47
); the results are summarized in Table 3. According to this test, our distance potential
W'DIST is clearly superior to every other residue-based distance or contact potential given in Table 3, as indicated by all available measures except S1 in the case of TE-13 and DFIRE-B. This difference is even more manifest when we consider the combination
W'LOC +
W'DIST. Table 3 also suggests that atom-based potentials perform on the average better than potentials considering only one interaction center per residue. Even so, the residue-based combination
W'DIST appears markedly more efficient than the RAPDF and KBP potentials. The good performances of the potentials DFIRE-A and DFIRE-B seem to result from the use of a particular reference state, defined in such a way that the effective energy associated to a pair of atoms (or residues) tends to zero when the distance separating them approaches 15 Å (47
). Let us also note that another statistical potential, based on a detailed (atomic) representation of protein structures and designed to describe H-bonds as precisely as possible, has been recently tested on the
group of decoy sets (19
). The results were slightly better than with our potentials (
Z
= 3.34 and S1 = 92%, whereas
Z
= 2.65 and S1 = 92% are obtained with
W'LOC +
W'DIST). It is not surprising that better predictive capabilities can be obtained with potentials based on a more detailed structural representation, but it should be stressed that a higher level of detail inevitably induces drastic limitations of the application possibilities.
|
| DISCUSSION |
|---|
|
|
|---|
Our derivation scheme is mainly based on the decomposition of a complex potential into a sum of lower order terms, through the expression of products of probabilities. This decomposition gives the possibility to analyze independently each contribution and clarify its significance and importance. It also offers several valuable advantages in terms of predictive power. First of all, according to the choice of the sequence/structure descriptors, the decomposition may be absolutely necessary to avoid overcounting certain contributions. To clarify this point, let us focus on the correlations between one residue type, s, and two backbone conformations, t. The correct contribution to the total free energy of a protein is given by Eq. 8, in this particular case: 
tts(C,S) =
i,j 
ts(ti,sj) +
i,j 
tt(ti,tj) +
i,j,k
tts(ti,tj,sk). In contrast, if the potential function 
tts(ti,tj,sk) was not decomposed and was summed over all triplets of positions (i,j,k), each 
ts and 
tt contribution would be counted several times.
Secondly, the decomposition we propose allows one to deal much more efficiently with the limited size of the database since the correction for sparse data (see Methods) is applied to each coupling term rather than on the whole energy function. For example, the distance potential
Watsdats(ai,ti,si,dij,aj,tj,sj) can be expressed as a sum of many n-coupling terms, ranging from n = 2 to n = 7, or computed directly from Eq. 3. If the database is large enough, these two possibilities are equivalent. But if the number of observations of a given combination of values of (ai,ti,si,dij,aj,tj,sj) is too small, the correction for sparse data will make 
atsdats(ai,ti,si,dij,aj,tj,sj) tend to zero, but not
Watsdats(ai,ti,si,dij,aj,tj,sj) unless it is computed directly through Eq. 3. In the latter case, the fact that the database is too small to reliably extract the higher order couplings actually leads to a consequent loss of valuable information about the lower order contributions. Finally, the decomposition makes it possible to modulate the reference state, by excluding some contributions (such as 
aa, 
ada,...) that do not appear to be relevant and decrease the overall predictive power.
The comparison with other potentials described in the literature underlines the generality of our approach, for previous potentials based on several sequence or structure descriptors can be expressed as particular cases of our formalism. This comparison also shows that we significantly raised the expectations regarding the predictive power of residue-based potentials. Indeed, our energetic functions even outperform some potentials that are based on a more detailed representation of protein structures at the atomic level.
Several improvements may still be envisaged. Indeed, our derivation scheme can easily be adapted to develop energy functions dealing with a more detailed representation of protein structures, or based on another, possibly more relevant, reference state. It is also straightforward to include additional structural descriptors, reflecting, for example, the relative orientations of interacting side chains or the relative positions of triplets of residues.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
M.R. is research director at the Belgian National Fund for Scientific Research.
Submitted on December 9, 2005; accepted for publication February 28, 2006.
| REFERENCES |
|---|
|
|
|---|
2. Halgren, T. A. 1995. Potential energy functions. Curr. Opin. Struct. Biol. 5:205210.[CrossRef][Medline]
3. Mackerell, A. D., Jr. 2004. Empirical force fields for biological macromolecules: overview and issues. J. Comput. Chem. 25:15841604.[CrossRef][Medline]
4. G
, N. 1983. Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12:183210.[CrossRef][Medline]
5. Galzitskaya, O. V., and A. V. Finkelstein. 1999. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl. Acad. Sci. USA. 96:1129911304.
6. Alm, E., and D. Baker. 1999. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl. Acad. Sci. USA. 96:1130511310.
7. Munoz, V., and W. A. Eaton. 1999. A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl. Acad. Sci. USA. 96:1131111316.
8. Wodak, S., and M. Rooman. 1993. Generating and testing protein folds. Curr. Opin. Struct. Biol. 3:249259.
9. Sippl, M. J. 1995. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 5:229235.[CrossRef][Medline]
10. Jernigan, R. L., and I. Bahar. 1996. Structure-derived potentials and protein simulations. Curr. Opin. Struct. Biol. 6:195209.[CrossRef][Medline]
11. Moult, J. 1997. Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 7:194199.[CrossRef][Medline]
12. Russ, W. P., and R. Ranganathan. 2002. Knowledge-based potential functions in protein design. Curr. Opin. Struct. Biol. 12:447452.[CrossRef][Medline]
13. Miyazawa, S., and R. L. Jernigan. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256:623644.[CrossRef][Medline]
14. Furuichi, E., and P. Koehl. 1998. Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins. 31:139149.[CrossRef][Medline]
15. Melo, F., R. Sanchez, and D. Sali. 2002. Statistical potentials for fold assessment. Protein Sci. 11:430448.
16. Kocher, J.-P., M. J. Rooman, and S. J. Wodak. 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235:15981613.[CrossRef][Medline]
17. Simons, K. T., C. Kooperberg, E. Huang, and D. Baker. 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268:209225.[CrossRef][Medline]
18. Zhang, C., and S.-H. Kim. 2000. Environment-dependent residue contact energies for proteins. Proc. Natl. Acad. Sci. USA. 97:25502555.
19. Kortemme, T., A. V. Morozov, and D. Baker. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol. 326:12391259.[CrossRef][Medline]
20. Buchete, N. V., J. E. Straub, and D. Thirumalai. 2004. Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci. 13:862874.
21. Miyazawa, S., and R. L. Jernigan. 2005. How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins. J. Chem. Phys. 122:2490124918.[CrossRef]
22. Rooman, M. J., J.-P. A. Kocher, and S. J. Wodak. 1991. Prediction of backbone conformation based on seven structure assignments. Influence of local interactions. J. Mol. Biol. 221:961979.[CrossRef][Medline]
23. Miyazawa, S., and R. L. Jernigan. 1999. Evaluation of short-range interactions as secondary structure energies for protein fold and sequence recognition. Proteins. 36:347356.[CrossRef][Medline]
24. Ramachandran, G., and V. Sasilekharan. 1968. Conformation of peptides and proteins. Adv. Protein Chem. 23:283438.[Medline]
25. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22:25772637.[CrossRef][Medline]
26. Rose, G. D., A. R. Geselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus. 1985. Hydrophobicity of amino acid residues in globular proteins. Science. 229:834838.
27. Wang, G., and R. Dunbrack. 2003. PISCES: a protein sequence culling server. Bioinformatics. 19:15891591.
28. Hendrick, K., and J. M. Thornton. 1998. PQS: a protein quaternary structure file server. Trends Biochem. Sci. 23:358361.[CrossRef][Medline]
29. Sippl, M. J. 1990. Calculation of conformational ensemble from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213:859883.[Medline]
30. Tobi, D., and R. Elber. 2000. Distance-dependent, pair potential for protein folding: results from linear optimization. Proteins. 41:4046.[CrossRef][Medline]
31. Samudrala, R., and M. Levitt. 2000. DecoysRUs: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9:13991401.[Abstract]
32. Park, B., and M. Levitt. 1996. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258:367392.[CrossRef][Medline]
33. Simons, K. T., I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker. 1999. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins. 34:8295.[CrossRef][Medline]
34. Keasar, C., and M. Levitt. 2003. A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329:159174.[CrossRef][Medline]
35. Tsai, J., R. Bonneau, A. V. Morozov, B. Kuhlman, C. A. Rohl, and D. Baker. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins. 53:7687.[CrossRef][Medline]
36. Kang, H. S., A. Kurochkina, and B. Lee. 1993. Estimation and use of protein backbone angle probabilities. J. Mol. Biol. 229:448460.[CrossRef][Medline]
37. Bahar, I., and R. L. Jernigan. 1997. Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266:195214.[CrossRef][Medline]
38. Zhang, L., and J. Skolnick. 1996. How do potentials derived from structural databases relate to "true" potentials. Protein Sci. 7:12011207.
39. Shan, Y., and H.-X. Zhou. 2000. Correspondence of potentials of mean force in proteins and in liquids. J. Chem. Phys. 113:457469.[CrossRef]
40. Thomas, P. D., and K. A. Dill. 1996. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 257:457469.[CrossRef][Medline]
41. Dehouck, Y., D. Gilis, and M. Rooman. 2004. Database-derived potentials dependent on protein size for in silico folding and design. Biophys. J. 87:171181.
42. Rooman, M., and D. Gilis. 1998. Different derivations of knowledge-based potentials and analysis of their robustness and context-dependent predictive power. Eur. J. Biochem. 254:135143.[Medline]
43. Godzik, A., A. Kolinski, and J. Skolnick. 1995. Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets. Protein Sci. 4:21072117.[Abstract]
44. Zhang, C., S. Liu, H. Zhou, and Y. Zhou. 2004. The dependence of all-atom statistical potentials on structural training database. Biophys. J. 86:33493358.
45. Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253:164170.
46. Summa, C. M., M. Levitt, and W. F. DeGrado. 2005. An atomic environment potential for use in protein structure prediction. J. Mol. Biol. 352:9861001.