help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ortiz, A. R.
Right arrow Articles by Skolnick, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ortiz, A. R.
Right arrow Articles by Skolnick, J.

Biophys J, October 2000, p. 1787-1799, Vol. 79, No. 4

Sequence Evolution and the Mechanism of Protein Folding

Angel R. Ortiz and Jeffrey Skolnick

Department of Molecular Biology, TPC-5, The Scripps Research Institute, La Jolla, California 92037 USA




    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

The impact on protein evolution of the physical laws that govern folding remains obscure. Here, by analyzing in silico-evolved sequences subjected to evolutionary pressure for fast folding, it is shown that: First, a subset of residues in the thermodynamic folding nucleus is mainly responsible for modulating the protein folding rate. Second and most important, the protein topology itself is of paramount importance in determining the location of these residues in the structure. Further stabilization of the interactions in this nucleus leads to fast folding sequences. Third, these nucleation points restrict the sequence space available to the protein during evolution. Correlated mutations between positions around these hot spots arise in a statistically significant manner, and most involve contacting residues. When a similar analysis is carried out on real proteins, qualitatively similar results are obtained.



    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

Prediction of protein structure from sequence is widely recognized as one of the most important unsolved problems in molecular biology, and it has become more pressing with the current blossoming of genome sequencing projects (Koonin, 1997, Koonin, et al., 1998). Many of the current approaches to protein structure prediction rely on relating sequence patterns to structural ones, particularly with the availability of multiple sequence alignments (MSAs) for many sequences of interest. But finding sequence-structure relationships requires discrimination from functional or even adventitious patterns in current databases that may not have unique correspondence with structure. Signatures related to protein topology are the needles to be found in the haystack of alignments. Also, because topology is the outcome of the folding mechanism, relationships among sequence space, the folding process, and final topology need to be established. In summary, a better understanding of the constraints imposed during protein evolution by the physical laws that govern folding is required.

The acquisition of such understanding is a formidable task. To make the problem tractable, simplified lattice models of protein-like chains have been developed by a number of authors (Shakhnovich, 1996). In particular, Shakhnovich and coworkers (Mirny, et al., 1998) recently generated a database of fast and slow folding models of homologous proteins (48-mers on a cubic lattice, see Methods and Fig. 2) by using evolution-like pressure for fast folding. Here, after analyzing these fast and slow folding model proteins and after doing similar calculations on real proteins, we report that structurally significant patterns can appear in MSAs of evolutionary related sequences. We observe that correlated mutations tend to occur between contacting residues, and that they seem to arise as a consequence of the evolutionary constraint to fold sufficiently fast to the global energy minimum. They tend to be located closer than expected by chance to the residues forming the protein folding nucleus. These correlations could also be topology dependent, because we show that the topology itself determines which residues form part of the folding nucleus.



    METHODS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

Database of model proteins

A database formed by in silico evolutionary related sequences of slow and fast folding model proteins was generously provided to us by Dr. E. Shakhnovich (personal communication). This database consists of ~1000 sequences of model proteins created by folding randomly mutated sequences of 48-mers on a cubic lattice subjected to evolutionary pressure for fast folding (Mirny, et al., 1998). The conformation to which all these sequences fold is shown in Fig. 2.

Multivariate analysis of the sequences of model proteins

Factor Analysis (Reyment and Joreskog, 1996) (FA) has been used to derive statistical models that explain the differences between fast and slow folding sequences. FA tries to describe the covariance relationships in a data matrix in terms of underlying, but unobservable, random quantities known as factors. The factors are new variables, or directions in space, built from linear combinations of the original variables in such a way that they account approximately for the same amount of information of the original variables in the data matrix. This is achieved by grouping variables by their correlations. Mathematically speaking, FA seeks finding an underlying orthogonal factor model of an original X matrix of the form
<B><UP>X</UP></B>=<B><UP>LF</UP></B>+<B><UP>E</UP></B>. (1)
In Eq. 1, X is a matrix derived from the MSA and having n rows, corresponding to n positions in the alignment; and m columns, corresponding to m sequences being aligned. On the right-hand side, L is the loadings matrix and F the scores matrix, whereas E is the residual (noise) matrix. Each element lip of the matrix L can be considered as the square root of the weight (or importance) of the variable i on factor p. That is, it is the contribution of variable (i.e., the position) i on building the new direction p. In contrast each component score fpj from the matrix F can be considered the projection of the object (i.e., the sequence) j on factor p, or, in other words, the coordinate of object j in the new direction given by factor p.

The existence of this model implies a special covariance structure for X. It can be shown that, under the assumption that F and E are orthogonal and independent, this covariance structure XX' can be expressed as
<B><UP>XX′</UP></B>=(<B><UP>LF</UP></B>+<B><UP>E</UP></B>)(<B><UP>LF</UP></B>+<B><UP>E</UP></B>)′, (2)

<B><UP>XX′</UP></B>=<B><UP>S</UP></B>=<B><UP>LL′</UP></B>+<B><UP>U</UP></B>. (3)
Here X' is the transpose of X, and U = EE' is the covariance of the noise matrix, which, under the factor model conditions, contains the specific (not interrelated and therefore not explained by the model) variances of the objects, assumed to be small: U congruent  0. The solution to the factor model can then be obtained by diagonalization of the covariance structure XX'. This is known as the principal component solution to the factor model (Johnson and Wichern, 1992). By using the Eckart-Young theorem (Reyment and Joreskog, 1996), S can be expressed as
<B><UP>S</UP></B>=<B><UP>P&Lgr;P′</UP></B>=<B><UP>P&Lgr;</UP></B><SUP>1/2</SUP><B><UP>&Lgr;</UP></B><SUP>1/2</SUP><B><UP>′P′</UP></B>=(<B><UP>P&Lgr;</UP></B><SUP>1/2</SUP>)(<B><UP>P&Lgr;</UP></B><SUP>1/2</SUP>)′. (4)
Thus, by comparing Eqs. 3 and 4, we can obtain the loadings as
<B><UP>L</UP></B>=<B><UP>P&Lgr;</UP></B><SUP><B><UP>1/2</UP></B></SUP>. (5)
Equation 5 indicates that the loadings are no more than the scaled eigenvectors of the symmetric matrix S. If principal components analysis is used as a solution of the factor model using Eq. 5 to obtain the loadings, then it is customary to generate the scores by an ordinary (unweighted) least squares procedure (Johnson and Wichern, 1992). Starting from Eq. 1, and assuming E congruent  0, then
<B><UP>L′X</UP></B>=<B><UP>L′LF</UP></B>, (6)

<B><UP>F</UP></B>=(<B><UP>L′L</UP></B>)<SUP><UP>−</UP>1</SUP><B><UP>L′X</UP></B>. (7)
Using Eqs. 3, 4, and 5, Eq. 7 can be expressed as
<B><UP>F</UP></B>=(<B><UP>&Lgr;</UP></B><SUP><B><UP>1/2</UP></B></SUP><B><UP>′P′P&Lgr;</UP></B><SUP>1/2</SUP>)<SUP><UP>−</UP>1</SUP>(<B><UP>P&Lgr;</UP></B><SUP>1/2</SUP>)′<B><UP>X</UP></B>, (8)

<B><UP>F</UP></B>=<B><UP>&Lgr;</UP></B><SUP><UP>−</UP>1/2</SUP><B><UP>P′X</UP></B>. (9)
Thus, the factor model can be solved using Eqs. 5 and 9 after diagonalizing the covariance structure XX'. After a lower dimensionality space of p factors explaining most of the variance of the original X-matrix is found, in FA, one proceeds to rotate the factors until a simpler structure is achieved (Reyment and Joreskog, 1996). This is done to obtain a better interpretation of the factors, and it is possible because Eq. 3 is insensitive to any orthogonal transformation of the loadings,
<B><UP>XX′</UP></B>=<B><UP>S</UP></B>=<B><UP>LL′</UP></B>+<B><UP>U</UP></B>=<B><UP>LTT′L′</UP></B>+<B><UP>U</UP></B>

<UP>where <B>TT′</B></UP>=<B><UP>I</UP></B>. (10)
The ideal rotation matrix T to apply to L would generate a new loadings matrix so that each variable has the simplest interpretation in the new space. This would be achieved by having in the rotated loadings matrix L* as many zero elements as possible. In this way, a variable would not depend on all common factors but only on a small part of them. This is equivalent to maximizing the total variance of the loading elements in the p selected factors. A popular way to obtain such a multidimensional rotation is by means of the varimax rotation, where an iterative pairwise bidimensional rotation method due to Kaiser (1958) is used to find an orthogonal optimal rotation matrix G that maximizes a measure of the variances of the loadings known as the Harman function, phi :
&phgr;=<LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>n</UP></UL></LIM> <LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>p</UP></UL></LIM>(d<SUP>2</SUP><SUB><UP>ij</UP></SUB>−<A><AC>d</AC><AC>&cjs1171;</AC></A><SUB><UP>j</UP></SUB>)<SUP>2</SUP>=<LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>n</UP></UL></LIM> <LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>p</UP></UL></LIM> d<SUP>4</SUP><SUB><UP>ij</UP></SUB>−p<LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>n</UP></UL></LIM> <A><AC>d</AC><AC>&cjs1171;</AC></A><SUP>2</SUP><SUB><UP>j</UP></SUB>, (11)
where
d<SUB><UP>ij</UP></SUB>=<FR><NU>l<SUB><UP>ij</UP></SUB></NU><DE>h<SUB><UP>i</UP></SUB></DE></FR> (12)
and
<A><AC>d</AC><AC>&cjs1171;</AC></A><SUB><UP>j</UP></SUB>=p<SUP><UP>−</UP>1</SUP> <LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>p</UP></UL></LIM> d<SUP>2</SUP><SUB><UP>ij</UP></SUB>. (13)
Here, lij is the ij element of the loadings matrix L, hi is the square root of the communality of the variable i (that is, the variance explained by the factor model), n equals the total number of variables, and p is the dimensionality of the loadings matrix. The rotation matrix G that maximizes Eq. 11 is used to obtain the rotated loadings by using the transformation,
<B><UP>L*</UP></B>=<B><UP>LG</UP></B>. (14)
The FA of the MSA was carried out using the correlation matrix, instead of the covariance matrix, as follows: The correlation matrix R was derived by first defining an exchange matrix at each sequence position in the alignment where each pair of elements in the position is scored by a modified version of the McLachlan matrix [optimized to maximize the contact prediction ability, (Ortiz and Skolnick, in preparation)], and then computing a Pearson-type correlation coefficient between exchange matrices at any two positions. Thus, the element rij of R is obtained,
r<SUB><UP>ij</UP></SUB>=<FR><NU>1</NU><DE>N<SUP>2</SUP></DE></FR> <LIM><OP>∑</OP><LL><UP>kl</UP></LL></LIM><FR><NU><FENCE>s<SUB><UP>ikl</UP></SUB>−<FENCE>s<SUB><UP>i</UP></SUB></FENCE></FENCE><FENCE>s<SUB><UP>jkl</UP></SUB>−<FENCE>s<SUB><UP>j</UP></SUB></FENCE></FENCE></NU><DE>&sfgr;<SUB><UP>i</UP></SUB>&sfgr;<SUB><UP>j</UP></SUB></DE></FR>. (15)
Where i and j are two different positions in the MSA, and the indices k and l run from 1 to the N of sequences in the family. The parameter sikl (sjkl) is the comparison score of the amino acids of sequences k and l at position i (j) of the alignment. Average values over all aligned sequences at positions i and j are given by < si> and < sj> , whereas sigma i and sigma j correspond to their standard deviation. Afterwards, the PCA solution to the R-mode FA model was obtained by diagonalizating R. The first p = 3 dimensions were scaled and subjected to a varimax rotation to obtain the loadings matrix. We have seen by trial and error that three factors contain enough variance to explore the main differences in the alignments without being anchored in the rotation by the less significant factors so that they provide optimal separation of residues and sequences, and, therefore, the clearest interpretation of the FA model. From the rotated loadings, the scores were computed by ordinary least squares.

Prediction of contacts and correlated mutation analysis

The procedure is based on computing the R matrix from the MSA, as defined above, followed by the sequential application of FA (Reyment and Joreskog, 1996) to eliminate phylogenetic relationships and partial correlation (Johnson and Wichern, 1992) to eliminate indirect effects of intervening or indirect variables. Typically, and for proteins up to about 100 residues, between 5 and 10 contacts are selected. A more detailed account of this methodology for contact prediction can be found in Ortiz, et al. (1999), and an extensive evaluation for real proteins will be given in a subsequent publication. For the proteins studied in this paper, the MSAs were obtained from the HSSP database (Sander and Schneider, 1991) for those proteins present in the protein databank (Bernstein, et al., 1977). As for the CASP3 proteins, a detailed account of the creation of the corresponding MSAs is presented in a separate publication (Ortiz, et al., 1999).

Identification of kinetically hot residues by the GNM

The basic idea behind the Gaussian Network Model (GNM) (Bahar, et al., 1997) is to consider the folded protein as a three-dimensional (3D) elastic network where residues undergo Gaussian fluctuations around their mean positions resulting from harmonic potentials between residues in contact. The Kirchhoff matrix of contacts Gamma  describes the dynamic characteristics of this network. This matrix is the counterpart of the stiffness matrix used in the analysis of elastic bodies, and its ij element is defined as
&Ggr;<SUB><UP>ij</UP></SUB>=<FENCE><AR><R><C>H(r<SUB><UP>c</UP></SUB>−r<SUB><UP>ij</UP></SUB>) </C><C><UP>if</UP> i≠j</C></R><R><C><UP>−</UP><LIM><OP>∑</OP><LL><UP>i</UP>(<UP>≠j</UP>)</LL><UL><UP>N</UP></UL></LIM> &Ggr;<SUB><UP>ij</UP></SUB></C><C><UP>if</UP> i=j</C></R></AR>.</FENCE> (16)
Where N is the total number of residues; rc is the contacting distance of two residues measured in the Calpha positions (a value of rc = 7 was used in this work); rij is the distance between the Calpha positions of the residues i and j; and H(x) is the Heaviside step function, given as H(x) = -1 if x > 0 and H(x) = 0 if x <=  0. The inverse matrix Gamma -1 yields correlations between the thermal fluctuations per residue. Thus, the cross-correlation in the motion of pairs of residues is associated with the kth vibrational mode by the equation,
<FENCE>&Dgr;<B><UP>R</UP></B><SUB><UP>i</UP></SUB>&Dgr;<B><UP>R</UP></B><SUB><UP>j</UP></SUB></FENCE><SUB><UP>k</UP></SUB>=<FENCE>3k<SUB><UP>B</UP></SUB>T/&ggr;</FENCE>&lgr;<SUP><UP>−</UP>1</SUP><SUB><UP>k</UP></SUB>⌊<B><UP>u</UP></B><SUB><UP>k</UP></SUB><UP><B>u</B></UP>′<SUB><UP>k</UP></SUB>⌋<SUB><UP>ij</UP></SUB>. (17)
From the above equation, the mean square fluctuation of residue i vibrating in a given subset of modes can be obtained from
 <FENCE><FENCE>&Dgr;<B><UP>R</UP></B><SUB><UP>i</UP></SUB></FENCE><SUP>2</SUP></FENCE><SUB><UP>k1−k2</UP></SUB>=<FENCE>k<SUB><UP>B</UP></SUB>T/&ggr;</FENCE> <LIM><OP>∑</OP><LL><UP>k=k1</UP></LL><UL><UP>k=k2</UP></UL></LIM> &lgr;<SUP><UP>−</UP>1</SUP><SUB><UP>k</UP></SUB><FENCE><B><UP>u</UP></B><SUB><UP>k</UP></SUB></FENCE><SUP>2</SUP><SUB><UP>i</UP></SUB><FENCE><LIM><OP>∑</OP><LL><UP>k=k1</UP></LL><UL><UP>k=k2</UP></UL></LIM> &lgr;<SUP><UP>−1</UP></SUP><SUB><UP>k</UP></SUB>.</FENCE> (18)
Following Demirel et al. (1998), we study the mean square fluctuations induced by the modes N - 4 <=  k < N. As in their study, we consider "kinetically hot residues" to be those residues in these modes having an average mean square fluctuation above 6N-1.

Interaction energy analysis

Let us denote each one of the N positions of the lattice model by i. In the global minimum conformation, each position has a given contact coordination number nci, with each contact k contributing an energy ES(ncik) for a given sequence. The homology averaged energy of each position in the global minimum is given by
E<SUB><UP>i</UP></SUB>=<FR><NU>1</NU><DE>N <UP>seq</UP> · nc<SUB><UP>i</UP></SUB></DE></FR> <LIM><OP>∑</OP><LL><UP>S</UP></LL></LIM> <LIM><OP>∑</OP><LL><UP>k</UP></LL></LIM> E<SUP><UP>S</UP></SUP><SUB><UP>i</UP></SUB>(nc<SUB><UP>ik</UP></SUB>). (19)
This value can be averaged over a window 2w + 1 to consider the mean homology-averaged energy of the different fragments in the structure,
<A><AC>E</AC><AC>&cjs1171;</AC></A><SUB><UP>i</UP></SUB>=<FR><NU>1</NU><DE>2w+1</DE></FR> <LIM><OP>∑</OP><LL><UP>j=i−w</UP></LL><UL><UP>j=i+w</UP></UL></LIM> E<SUB><UP>j</UP></SUB>. (20)
The excess energy (normalized in its structural context) of each residue is then a measure related to the local frustration of that residue in its structural context,
F<SUB><UP>i</UP></SUB>=(E<SUB><UP>i</UP></SUB>−<A><AC>E</AC><AC>&cjs1171;</AC></A><SUB><UP>i</UP></SUB>)−<FENCE>E<SUB><UP>i</UP></SUB>−<A><AC>E</AC><AC>&cjs1171;</AC></A><SUB><UP>i</UP></SUB></FENCE> (21)
whereas the average stability of the fragment in which the residue is immersed is given by
S<SUB><UP>i</UP></SUB>=<A><AC>E</AC><AC>&cjs1171;</AC></A><SUB><UP>i</UP></SUB>−<FENCE><A><AC>E</AC><AC>&cjs1171;</AC></A><SUB><UP>i</UP></SUB></FENCE>. (22)

Nonparametric regression

The nonparametric regression function m(X) of a response variable Y in measuring a set of variables X is defined as the conditional mean of Y on X, i.e., m(X) = E(Y|X). Thus, for a given sample of design variables {Xi}i=1n, the associated response variables {Yi}i=1n are of the form:
Y<SUB><UP>i</UP></SUB>=m(<B><UP>X</UP></B><SUB><UP>i</UP></SUB>)+ϵ<SUB><UP>i</UP></SUB> <UP>with</UP> i=1,…, n, (23)
where {varepsilon i}i=1n are independent errors. Within a given sample, the nonparametric estimate &mcirc; of the function m has the general form
<A><AC>m</AC><AC>ˆ</AC></A>(<B><UP>X</UP></B>)=<LIM><OP>∑</OP><LL><UP>i</UP></LL></LIM> w<SUB><UP>i</UP></SUB>(<B><UP>X</UP></B>)y<SUB><UP>i</UP></SUB> <UP>with</UP> <LIM><OP>∑</OP><LL><UP>i</UP></LL></LIM> w<SUB><UP>i</UP></SUB>(<B><UP>X</UP></B>)1. (24)
Several nonparametric estimates have been proposed in the statistical literature. In this work, we considered the shifted Nadaraya-Watson (SNW) estimate (Mammen and Marron, 1997), which appears stable when data are sparse (Hardle and Marron, 1995). The SNW estimate is defined as
<A><AC>m</AC><AC>ˆ</AC></A><SUB><UP>SNW</UP></SUB>(x)=<FR><NU><LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>n</UP></UL></LIM> K<SUB><UP>h</UP></SUB>(x<SUB><UP>i</UP></SUB>−&xgr;<SUP><UP>−</UP>1</SUP>(x))y<SUB><UP>i</UP></SUB></NU><DE><LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>n</UP></UL></LIM> K<SUB><UP>h</UP></SUB>(x<SUB><UP>i</UP></SUB>−&xgr;<SUP><UP>−</UP>1</SUP>(x))</DE></FR>, (25)
with
&xgr;(x)=<FR><NU><LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>n</UP></UL></LIM> K<SUB><UP>h</UP></SUB>(x<SUB><UP>i</UP></SUB>−x))x<SUB><UP>i</UP></SUB></NU><DE><LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>n</UP></UL></LIM> K<SUB><UP>h</UP></SUB>(x<SUB><UP>i</UP></SUB>−x)</DE></FR>. (26)
In the previous formulae, Kh(·) is a renormalized kernel function, taken as a Gaussian density throughout this work, being h-1 K (·/h). The parameter h is the bandwidth or smoothing parameter. The role of h is essential in kernel regression. On the one hand a too low value leads to highly rugged surfaces and variant or sample dependence. On the other hand, high values introduce bias in the curve estimation and eliminate the fine structure of data. A simple quantification of this trade-off is given by the integrated mean square error (Hardle and Marron, 1995). It can be shown (Hardle and Marron, 1995) that the asymptotically optimal global bandwidth (i.e., the one that minimizes the integrated mean square error) for the SNW estimate is given by
h<SUB><UP>opt</UP></SUB>=<FENCE><FR><NU>∫ V(<B><UP>X</UP></B>)</NU><DE>4n ∫ B<SUP>2</SUP>(<B><UP>X</UP></B>)</DE></FR></FENCE><SUP>1/5</SUP>, (27)
where V(X) is the variance and B2(X) is the square bias. These terms can be estimated from the sample itself and plugged into Eqs. 24 and 25. Fast and simple estimates of these functions can be obtained based on polynomials and histograms, respectively. In the present work, the block method of Hardle and Marron (1995) for the estimates of the bias and variance appearing in Eq. 27 proved to be robust. This method is based on dividing the sample in histograms or blocks and fitting a low degree polynomial in each block.



    RESULTS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

Analysis of the lattice protein models

Sequence factor analysis

We start by trying to identify from the sequences, by using multivariate analysis, clusters of positions that can account for the kinetic differences between slow and fast folding lattice proteins. Fig. 1 and Table 1 summarize the results of an FA (Reyment and Joreskog, 1996) (see Methods for description of this multivariate technique) applied to the fast and slow folding protein sequences. The first 200 slow-folding and the last 217 fast-folding sequences were subjected to the analysis. In Fig. 1, A and B, the original sequence space spanned by the 200 + 217 = 417 aligned sequences is projected onto the first two dimensions or factors (Fig. 1 B), or onto the first and third dimension (Fig. 2). In Fig. 1, loadings are represented as diamonds, whereas scores are represented as lines (see Methods for definitions of loadings and scores). These lines connect the 417 sequences following the in silico evolutionary time of the evolutionary experiment. Sequences in the alignment are classified into three groups: slow-folding sequences (represented by a blue line), fast-folding sequences (red line), and very-fast-folding sequences (magenta line). Note that the coordinates of the line points in the second factor are very different for slow- and fast-folding sequences, i.e., the points of these two groups of sequences are well separated along this axis. Also, there is some separation along the first factor of the very-fast-folding sequences from the rest. Finally, note that there is no significant discrimination along the third factor (Fig. 1 B). Thus, this analysis suggests that, at the sequence level, the main differences in the kinetic behavior of the lattice proteins can be explained by two underlying factors.




View larger version (16K):
[in this window]
[in a new window]
 
FIGURE 1   Factor analysis of fast- and slow-folding sequences. The analysis included the first 200 slow-folding sequences and the last 217 fast-folding sequences. (A) Two-dimensional (2D) plot of the two first factors. (B) 2D plot of the second and third factor. In both figures, loadings are represented as diamonds, whereas scores are represented as lines connecting sequences following the evolutionary time. Slow-folding sequences are represented by a blue line, fast-folding sequences by a red line, and very-fast-folding sequences by a magenta line.



                              
View this table:
[in this window]
[in a new window]
 
TABLE 1   Factor Analysis of fast- and slow-folding sequences




View larger version (99K):
[in this window]
[in a new window]
 
FIGURE 2   Mapping of the residues heavily loading in the first two factors onto the lattice structure, corresponding to the global minimum energy conformation of the 1000 48-mer sequences analyzed in this paper. The first factor is shown in blue and the second factor in yellow.

The first factor, explaining a higher proportion of the variance, is what we call the loop factor, because residues heavily loading in the first component are loop residues (Fig. 2). Within the loop residues, two types of positions can be distinguished, evolving in an anticorrelated fashion: position #7 becomes more hydrophilic, whereas the rest of residues become more hydrophobic. The later are in contact and apparently are required to fix or stabilize the loop conformation, whereas the former possibly avoids competing interactions with the core, making the energy landscape less rugged. Karplus and coworkers (Dinner et al., 1998) have recently obtained similar conclusions for a different system using a different potential energy function. However, they observed that only the weakening of interactions between noncore residues increases the folding rate.

The second factor accounting for a high proportion of the variance is the nucleus factor. A subset of the residues belonging to the previously detected thermodynamic folding nucleus (Mirny et al., 1998) is found here (Fig. 2). However, not all the residues previously described as being part of the folding nucleus are discriminating (compare the yellow residues in Fig. 2 with the red residues in Fig. 5). This is, in part, the result of complete conservation of some of these residues between fast- and slow-folding sequences (e.g., for residues #5 and 20). There are also some exceptions, for example, with residue #16. In contrast, residues #19 and 30 contribute to the differences in rate, but are not part of the thermodynamic folding nucleus.

It is important to note that the ranking of factors obtained from the FA model only reflects the fact that there are more representatives of the medium- and fast-folding sequences than of the slow-folding ones, resulting from the fast optimization of the folding rate during the first mutations. Thus, from the viewpoint of explaining the covariance structure, the loop factor is more important (i.e., accounts for a higher proportion of the variance) than the nucleus factor. However, the nucleus factor is far more important in its ability to discriminate fast- from slow-folding sequences, whereas the loop factor only makes a modest contribution. This is evidenced by the separation of the fast- and slow-folding proteins along both axes.

To confirm the results of this analysis, it is also of interest to carry out an FA using the slow-folding sequences alone. Figure 3 and Table 2 show the results. The analysis indicates that the third factor can explain the fast increment in the folding rate of the sequences (Fig. 3), whereas the first two factors have much less explanatory power. Residues loading in Factor 3 (Table 2) are mainly a subset of those residues (#15, 41, and 35) also identified as being part of the thermodynamic folding nucleus by Shakhnovich and coworkers (Mirny, et al., 1998) (Fig. 5).




View larger version (11K):
[in this window]
[in a new window]
 
FIGURE 3   Factor Analysis of the slow-folding sequences. The analysis included the first 100 slow-folding sequences. Symbols as in Fig. 1. The 2D plot of the first and third factor is shown.



                              
View this table:
[in this window]
[in a new window]
 
TABLE 2   Factor Analysis of the slow-folding sequences only

Thus, differences in kinetic behavior of the model proteins originate from sequence differences detectable by multivariate analysis of the alignments. Folding rate is optimized rapidly, with the bulk of the optimization depending only on mutations in a small number of residues, a subset of the thermodynamic folding nucleus. Once this nucleus is stabilized, a slight increment in folding rate can be achieved by loop mutations.

Structural vibrational analysis

Following the ideas of Demirel et al. (1998), we identify those residues from the protein structure which can be described as being kinetically hot. By this, we mean that these residues present a high vibrational frequency in the folded state, implying that their motion is of small amplitude. This is a signature of a steep potential energy surface around their mean position, which arises from the contact coordination number of the residue and the topological restrictions imposed by the structure. The vibrations of the whole structure can be decomposed into modes, with all residues vibrating in a particular mode presenting concerted motions. Thus, residues forming part of the same high-frequency modes correspond to those residues having a dense network of interactions, being rather localized in space and moving coherently in the context of the tertiary fold. As demonstrated by Demirel et al., they are candidates to be part of the folding nucleus.

The main features of the protein fluctuation in the folded state can be obtained by application of the GNM (Bahar, et al., 1997). Thus, a vibrational analysis of the lattice conformation in the global energy minimum has been performed using the GNM (see Methods). The average residue fluctuations in the last N - 4 to N - 1 modes have been compared with the loadings in the second factor obtained from the FA of the fast- and slow-folding sequences (see above). The values obtained from the raw analysis were smoothed by a shifted Nadaraya-Watson nonparametric regression (Mammen and Marron, 1997) using the Hardle-Marron automatic global bandwidth selector (Hardle and Marron, 1995). Missing values from the loadings (i.e., null values) were removed from the design points and predicted by the regression curve (see Methods). The results are shown in Fig. 4. Residues #5 and 20, forming part of the thermodynamic folding nucleus detected by Shakhnovich and coworkers but kept constant during the evolutionary experiment, are predicted to contribute significantly to the increased folding rate, particularly residue #20. The rest of the residues not mutated during the computational experiment are predicted to be almost irrelevant for folding-rate optimization.




View larger version (16K):
[in this window]
[in a new window]
 
FIGURE 4   The second factor obtained from the FA of fast- and slow-folding sequences is plotted as a function of residue number (blue line), together with the average mean square fluctuation per residue in the last N - 4 to N - 1 modes, calculated from a GNM analysis of the global minimum of the sequences. Both lines have been previously smoothed by a nonparametric regression, as described in Methods.

The similarity in the behavior of both curves in Fig. 4 is striking. We note that one of the curves is derived from the analysis of the sequences alone, whereas the other one derives from an analysis devoid of sequence information, of the topology in the global minimum. This result implies that, at least for this simple lattice model, the topology itself determines which residues will participate both in the folding nucleus and in the optimization of the folding rate. An interpretation of this conclusion is that both analyses are reflecting the same physical process. The multivariate analysis of the sequences uncovers the subset of the most important positions in changing the folding rate, and therefore the free energy of the transition state. In contrast, it is thought that the selection of these residues in the folding nucleus arises in the physical system as the result of a topology-dependent optimal enthalpy-entropy balance in making particular intramolecular contacts (Alm and Baker, 1999; Galzitskaya and Finkelstein, 1999; Munoz and Eaton, 1999). The vibrationally coupled units detected by the GNM in the native structure appear to be sensitive to a similar balance, as empirically shown by Demirel et al. (1998), perhaps as a result of the topological resemblance between the transition and native states.

Interaction energy analysis

Interaction energy analysis has been used to obtain a deeper insight into the energetics underlying the results of the GNM and FA analysis. It would be expected that the residues identified by the GNM and FA interact more favorably than average with the rest of the protein. If so, that would imply that they might constitute a critical folding nucleus. For a heteropolymer to fold to a specific native structure, native interactions must be stronger than those present in alternative nonnative states; that is, the energy gap between the native state and first excited state must exceed some threshold value (Sali et al., 1994). At the same time, to fold fast, it is necessary to lower the free energy barrier of the transition state, corresponding to the formation of the critical nucleus (Shakhnovich, 1997). Thus, stabilization of the interactions of this nucleus with respect to the rest of the interactions increases the folding rate. Is this description consistent with our findings of the sequence-topology connection in the model proteins?

This is supported by the findings shown in Fig. 7. Here, we plot a parameter related to the average local frustration of each residue in the lattice protein versus a parameter reflecting the average fragment stability in which the residue is immersed (see Methods). A strong segregation of residues can be observed for the fast-folding sequences when compared with the slow-folding ones: in the fast-folding sequences, most of residues are close to inert, with slightly repulsive environments and with a small value of frustration. Among the group of residues with low values of local frustration are residues #16, 41, and 20, all of them important in increasing the folding rate (Fig. 4), with residues #41 and 16 forming a contact and loading together in the FA. Thus, during the optimization of the folding rate, strong favorable interactions are placed between these residues of the folding nucleus, and mild repulsive or inert interactions (with respect to the average) are placed elsewhere. These results support the picture of the sequence-topology connection and suggest that a way to engineer fast folding in real proteins is to select the kinetically hot residues determined by the vibrational analysis of the topology and then to optimize only the interactions of these residues.

Presence of correlated mutations

After the folding rate is optimized, accepted mutations will tend to minimally perturb the stability of the critical nucleus. This foldability requirement should tend to create restrictions in sequence space. It is becoming well established that residues forming part of the folding nucleus tend to be conserved (Ptitsyn, 1998; Shakhnovich, et al., 1996), but it is also possible to imagine some other, more subtle restrictions imposed by the folding nucleus. Thus, the mutational behavior of other residues outside the nucleus could also be restricted in varying degrees, creating patterns of variability, for example, in the form of correlations. Indeed, Table 3 and Fig. 6 show that correlated mutations emerge from the set of evolutionary related sequences under fast folding pressure. Most important, the residues involved in correlation are either forming contacts or shifted in sequence by, at most, two residues in contact map space (Table 4).



                              
View this table:
[in this window]
[in a new window]
 
TABLE 3   Factor Analysis of fast-folding sequences



                              
View this table:
[in this window]
[in a new window]
 
TABLE 4   Average sequence distance between predicted contacts and kinetically hot residues, together with its statistical significance, evaluated for a representative set of real proteins

An FA of the fast-folding sequences alone shows that the residues having correlated mutations are mostly in a common subspace (first factor of the loadings matrix, accounting for the 9.6% of the total variance, see Table 4). This means that all pairs of correlated mutations change roughly together and in a coherent fashion. The existence of this concerted behavior suggests the presence of a common physical mechanism explaining the changes in all pairs of correlated positions at the same time. The physical basis of this underlying factor appears to be the closeness to the critical residues (Fig. 5). This has been established by comparing the probability of obtaining the observed average sequence distance by chance between the correlated positions and the kinetically hot residues (Fig. 6 A). For the closest element of the correlated pair, the observed average closeness in sequence of 0.85 is significant at the 95% confidence interval (Zscore = -2.06). However, the results are not significant for the other element of the pair (Zscore = -1.02). In 3D space, the results are similar, showing that residues having correlated mutations form a shell around the kinetically hot residues, which cannot be explained by chance (Fig. 6 B). Thus, on the basis of this analysis, several structural phases can be distinguished in the sequence space of a protein under the steady-state folding rate regime: the (almost) conserved folding nucleus, the shell of residues around it involved in correlated mutations, and the rest of residues mutating (almost) independently.




View larger version (104K):
[in this window]
[in a new window]
 
FIGURE 5   Display of the predicted contacts for the model sequences using an alignment based on the last 517 fast-folding sequences. Predicted contacts are shown as dotted lines connecting the corresponding residues. Positions in red correspond to the residues participating in the thermodynamic folding nucleus as described by Shakhnovich and coworkers (Mirny et al., 1998).




View larger version (20K):
[in this window]
[in a new window]
 
FIGURE 6   (A) Probability distribution of the average sequence distance separation between the folding nucleus and the predicted contacts in the fast-folding model sequences. (B) Percentage of distances within a given distance for the lattice structure, together with the average values of the observed distances between correlated positions and kinetically hot residues.




View larger version (16K):
[in this window]
[in a new window]
 
FIGURE 7   Plot of the average fragment stability (Eq. 22) versus the average residue frustration (Eq. 21) for the 1000 48-mer model sequences using the Miyazawa-Jernigan (Miyazawa and Jernigan, 1985) interaction energy matrix. The numbers indicate the corresponding residue number of the 48-mer, and the lines link pairs of residues predicted to be in contact. (A) Plot obtained from the fast-folding sequences. (B) The same plot from the slow-folding sequences.

Analysis of real proteins

We were interested to see whether there is a qualitative correspondence between the results obtained with the lattice proteins and what can be observed in real proteins. Although the situation with real proteins is more complex, at least a qualitative agreement should be obtained. For such comparison to be meaningful, the same computational procedures should be applied in both cases, with the same coarse graining. Thus, core residues were automatically selected from the 3D structure in the same way it was done for the lattice proteins, based on the GNM calculations, while correlations in the alignments were also computed following the same procedure used with the lattice proteins. We created a test set including proteins for which experimental data about their folding nucleus were available, as well as proteins for which contacts were predicted blindly, in advance of the knowledge of the structure, during the recent CASP3 contest (http://PredictionCenter.llnv.gov/casp3) (see Ortiz et al., 1999).

Contacts are predicted from the MSAs as described (Ortiz et al., 1999) (see Methods). Overall, the prediction accuracy in this small sample is similar to that obtained when a larger number of proteins was used when developing the prediction method. Thus, most of the predicted contacts have correspondence with a real contact within a local sequence window of delta  = 3 (data not shown). In contrast, from the topology of the protein, a vibrational analysis with the GNM was conducted as described (Bahar, et al., 1997). In agreement with Demirel et al. (1998), we observe a statistically significant overlap between the experimentally described folding nucleus and the kinetically hot residues (data not shown). Similar to the results obtained with the lattice protein, we also observe a statistically significant short sequence distance between the closest element of the predicted contact from the correlated mutation analysis and the closest kinetically hot residue (Table 4), although this is not the case for the second element of the pair (Table 4). Similar results were obtained when analyzing the relationship in 3D space. Thus, it is found that residues from the closest member of a correlated mutation pair tend to appear in the first coordination shell of a kinetically hot residue more often than expected by chance (see Table 5 and Fig. 8). A somewhat weaker tendency is observed for the second element, which tends to be located in the second solvation shell of the kinetically hot residues (Fig. 8) and in contact with the first element of the pair. Both elements have significantly different radial distribution functions with respect to the kinetically hot residues (Table 5). A qualitative picture of the relationships can be seen in Fig. 9.



                              
View this table:
[in this window]
[in a new window]
 
TABLE 5   Three-dimensional distance between predicted contacts and kinetically hot residues, together with the results of the computation of its statistical significance




View larger version (9K):
[in this window]
[in a new window]
 
FIGURE 8   Profile of the preference of observing a residue having correlated mutations with respect to the average values observed in protein structures at a given distance. The points represent the raw values obtained from the direct analysis. The line is the nonparametric regression curve (see Methods). The values have been shifted to zero at the origin.




View larger version (29K):
[in this window]
[in a new window]
 
FIGURE 9   A schematic picture of the 3D relationship between kinetically hot/folding nucleus residues and residues having correlated mutations in sequence space.



    CONCLUDING REMARKS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

Topology is a main factor determining the identity of the residues forming the folding nucleus, and folding rate is strongly dependent on the stability of a subset of these residues. The requirement of a minimum stability in the folding nucleus appears to create some restrictions in the sequence space of the residues forming the coordination shell around the critical nucleus. One realization of these restrictions is in the form of correlated mutations, which, as a result of these topological constraints, tend to occur with higher frequency between contacting residues. These results are consistent with a nucleation-condensation model for protein folding and have implications in the development of methods for structure prediction.


    ACKNOWLEDGMENTS

We thank L. A. Mirny and E. I. Shakhnovich for making their database of model proteins available to us. A.R.O. acknowledges partial support from the Spanish Ministry of Education. A.R.O. also acknowledges Pere Constans for discussions and source code concerning nonparametric regression. Support from National Institutes of Health grant GM37408 is gratefully appreciated.


    FOOTNOTES

Received for publication 18 June 1999 and in final form 24 February 2000.

Address reprint requests to Dr. Angel Ramirez Ortiz, Assistant Professor, Department of Physiology and Biophysics, Mount Sinai School of Medicine, One Gustave Levy Pl., Box 1218, New York, NY 10029.



    REFERENCES
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
CONCLUDING REMARKS
REFERENCES

Biophys J, October 2000, p. 1787-1799, Vol. 79, No. 4
© 2000 by the Biophysical Society   0006-3495/00/10/1787/13  $2.00



This article has been cited by other articles:


Home page
Protein Sci.Home page
C. Blouin, D. Butt, and A. J. Roger
Rapid evolution in conformational space: A study of loop regions in a ubiquitous GTP binding domain
Protein Sci., March 1, 2004; 13(3): 608 - 616.
[Abstract] [Full Text] [PDF]


Home page
Biophys. JHome page
J. P. Zbilut, A. Colosimo, F. Conti, M. Colafranceschi, C. Manetti, M. Valerio, C. L. Webber Jr., and A. Giuliani
Protein Aggregation/Folding: The Role of Deterministic Singularities of Sequence Hydrophobicity as Determined by Nonlinear Signal Analysis of Acylphosphatase and A{beta}(1-40)
Biophys. J., December 1, 2003; 85(6): 3544 - 3557.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ortiz, A. R.
Right arrow Articles by Skolnick, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ortiz, A. R.
Right arrow Articles by Skolnick, J.