| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Biophys J, November 2000, p. 2252-2258, Vol. 79, No. 5
Complex Systems Division, Department of Theoretical Physics, Lund University, Sölvegatan 14A, S-223 62 Lund, Sweden
| |
ABSTRACT |
|---|
|
|
|---|
We study the statistical properties of hydrophobic/polar model sequences with unique native states on the square lattice. It is shown that this ensemble of sequences differs from random sequences in significant ways in terms of both the distribution of hydrophobicity along the chains and total hydrophobicity. Whenever statistically feasible, the analogous calculations are performed for a set of real enzymes, too.
| |
INTRODUCTION |
|---|
|
|
|---|
Functional protein sequences exhibit the ability
to fold spontaneously into a unique native state (Creighton,
1993
). A natural step in order to understand this crucial
property is to compare good and bad folding sequences in simple models
where conformational space can be properly explored. Most such studies
have been directed toward identifying physical characteristics of good
folders, and in this important area some progress has been made
(S
li et al., 1994
; Bryngelson et al.,
1995
; Klimov and Thirumalai, 1998
;
Nymeyer et al., 1998
). In this paper we address the
question of how good folders differ from random sequences in purely
statistical terms. A related but different topic is how sequences that
share the same (unique) native state are distributed in sequence space. This question and its evolutionary implications have recently attracted
considerable attention (Li et al., 1996
;
Bornberg-Bauer, 1997
; Govindarajan and Goldstein,
1997a
,b
; Bastolla et al., 1999
; Broglia
et al., 1999
; Bornberg-Bauer and Chan, 1999
;
Tiana et al., 2000
).
In a recent study of a hydrophobic/polar off-lattice model, it was
found that good folders tend to show negative hydrophobicity correlations along the chains (Irbäck et al.,
1997
). The analogous calculations gave, moreover, qualitatively
similar results for a major class of real proteins, corresponding to
typical total hydrophobicities (Irbäck et al.,
1996
). On the other hand, the opposite behavior, positive
hydrophobicity correlations, has been reported for a class of designed
model sequences that display certain protein-like features
(Khokhlov and Khalatur, 1998
, 1999
). These designed
sequences are, for instance, not meant to have unique native states, so
the different results do not represent a contradiction. However, it
shows that sequence correlations in proteins is a delicate issue that
requires a careful analysis.
The main goal of this paper is to test the robustness of the conclusion
that good folding model sequences as well as functional proteins show
negative hydrophobicity correlations. To this end we perform new
calculations for both model and real sequences. The model we study is
the minimal HP model on the square lattice (Lau and Dill,
1989
; Dill et al., 1995
). This choice makes it possible for us to improve significantly on the statistics in the
previous study (Irbäck et al., 1997
), which was
based on an off-lattice model. The real sequences studied are
single-domain enzymes taken from the CATH protein structure
classification database (Orengo et al., 1997
), which we
hope displays statistical properties representative of functional
(globular) folding units. With this restriction on protein type, it
turns out that the previous, somewhat artificial, restriction on total
hydrophobicity (Irbäck et al., 1996
) can be lifted.
| |
METHODS |
|---|
|
|
|---|
Sequences
Let us first define the sequences studied. The real sequences
studied are the 173 nonhomologous single domain enzymes found in the
October 1998 release of the CATH database (Orengo et al., 1997
). These sequences are transformed into binary
hydrophobicity strings, by taking the six amino acids Leu, Ile, Val,
Phe, Met, and Trp as hydrophobic (
i = 1) and the
others as hydrophilic (
i =
1). This choice is
somewhat arbitrary. Therefore, we also tried a 20-valued hydrophobicity
scale, which did not affect any of the conclusions below. In CATH, the
most general level of classification is denoted "class" and
describes the relative content of
helices and
sheets. Below,
the class dependence of our results is checked by separate calculations
for each of the three major classes: mainly
, mainly
, and

. A fourth class, low secondary structure content, exists but it
is not considered separately, as only 3 of the 173 sequences belong to
it. In our calculations we also divide the sequences into extracellular
and intracellular ones. Following Martin et al. (1998)
,
we take the presence of a disulphide bridge as an indicator of
extracellular location. The number of enzymes in the different subsets
studied can be found in Table 3 below.
The model we use is the minimal two-dimensional HP model (Lau
and Dill, 1989
), whose behavior is known in quite some detail (Dill et al., 1995
). It contains only two types of amino
acids, H (hydrophobic,
i = 1) and P (polar,
i =
1), and the chain conformation is represented
as a self-avoiding walk on a lattice. The formation of a hydrophobic
core is favored by defining the energy as minus the number of HH
pairs that are nearest neighbors on the lattice but not along the
chain. On the square lattice, it turns out that this simple choice of
energy function is sufficient in order to get a significant number of
sequences with unique ground states (Chan and Dill,
1994
; Irbäck and Sandelin, 1998
); complete
enumeration of all possible sequences and structures shows that the
fraction of such sequences is roughly 2% for N
18. Throughout this paper we consider all HP sequences that have unique
ground states as good folding sequences. Also central is that the
sequences are able to fold fast into their native states, a requirement
that we ignore. This is a reasonable simplification because the
sequences are short and because almost all have the same energy gap
between ground state and next lowest level.
Sequence correlations
Our statistical analysis of hydrophobicity strings can be
divided into two parts. The first part deals with the distribution of
hydrophobicity along the chains; how does a "good" sequence with length N and total hydrophobicity
|
(1) |
In addition to the distribution of hydrophobicity along the chains, we also study the distribution of the total hydrophobicity M. This analysis relies entirely on comparisons between observed sequences, which makes it statistically more difficult, especially for the real sequences with varying N.
The blocking method
In this method, for a given size s, the sequence is divided into blocks each consisting of s consecutive
i along the chain. The block variable
k(s) is then defined as the sum of the s
i values in block k (k = 1, ... ,
N/s). A useful quantity is the mean-square fluctuation
|
(2) |
|
(3) |
(s) over all
possible sequences with given N and M takes the
simple form (Irbäck et al., 1996
|
(4) |
The distribution of total hydrophobicity
We study the M distribution for different fixed N, focusing on the mean
M
N (the
subscript indicates fixed N) and the normalized variance
|
(5) |
|
(6) |

i
N)/2
denotes the fraction of sequences that have
i = 1, and cij = 
i
j
N

i
N
j
N
is the
i,
j correlation. So, if the
i values are uncorrelated, then
|
(7) |
|
(8) |
M
N can be approximately
described by a simple linear relation,
M
N
= (2
1)N. As an effective measure of the fluctuations
in M, we therefore consider
|
(9) |
i values for each N were
uncorrelated with identical hi =
, then we would have
|
(10) |
(s) and
are
fundamentally different measurements. In the blocking method individual
sequences are compared to random sequences with the same N
and M. Hence,
(s) provides direct information
on the distribution of
i = ±1 along the chains.
This is not true for
and the correlation
cij. This correlation is not necessarily
physical. The behavior of the analogue of cij in
the ordered phase of an Ising magnet provides an illustration of this.
In this case, cij does not vanish at large
distance, although the physical correlation length is finite.
Individual structures
As mentioned in the Introduction, several recent model studies
have addressed the question of how sequences that fold to the same
native state are related. In particular, using an HP-like model with
compact structures only, Li et al. (1996)
found that structure-preserving mutations tend to be largely independent for
highly designable structures. To see whether this behavior is
consistent with our analysis, we perform two measurements for different
fixed structures, too.
Consider a given structure r, and let
{hi(r)} be the corresponding
hydrophobicity profile (hi(r) is the
probability that
i = 1). The first quantity we
calculate is
|
(11) |
(r) is defined as
in Eq. 5 but for fixed structure. 
(r) measures the average
i,
j correlation for fixed structure (see Eq. 6). The second quantity is the entropy
|
(12) |
i with hydrophobicity
profile {hi(r)}. If the
i values are approximately independent, then
eS provides an order-of-magnitude estimate of the actual
number of sequences, Nr. If this is not the
case, then eS overestimates
Nr.
| |
RESULTS |
|---|
|
|
|---|
In this section we present the results of our analyses of the
mean-square block fluctuations
(s) and the distribution
of total hydrophobicity, M, for model and real sequences. We
end the section with some comments on our model results and related
studies of similar models.
The blocking method
Model sequences
In our block variable analysis of HP sequences, we consider the 6349 N = 18 sequences that have unique native states, which can be obtained by exhaustive enumeration (Chan and Dill, 1994
(s) in two ways for each
sequence: first, for the full sequence; and second, after elimination
of two amino acids at each end. Fig. 1
shows the results of both these calculations. We see that the average
(s) is smaller than for random sequences, irrespective
of whether the endpoints are included or not. The conclusion that
(s), on average, is suppressed for good sequences is in
perfect agreement with earlier results for a different model
(Irbäck et al., 1996
|
|
Enzymes
We now repeat essentially the same analysis for the enzymes. The only difference is that, because N is not fixed, the hydrophobicity profile h(
) is taken to be a function of
the relative position
along the chains. To calculate
h(
), we divide the interval in
from 0 (N end) to 1 (C
end) into 100 bins. The results obtained are shown in Fig.
2 a. We see that
h(
) is approximately constant throughout the interval
0
1.
|
k(4) (see Eq. 2) as a function of
, using 25 bins in
. The results are shown in
Fig. 2 b. Although the uncertainties are somewhat large,
there is no sign of the ends behaving differently.
Given these two findings, we calculate the block fluctuations using the
full sequences, without any elimination of amino acids at the ends.
In Fig. 3 we show the average
(s) against block size s for the 173 enzymes.
Also shown are the results obtained for five different subsets of these
sequences (see Methods). We see that the results are similar in the
different cases, and that
(s) is smaller than for random
sequences. Qualitatively, the behavior is similar to that found for the
model sequences.
|
(s). Similar
deviations from randomness are expected in other quantities such as the
number of hydrophobic/hydrophilic clumps along the chain. The number of
clumps tends to be large when
(s) is small
(Irbäck et al., 1997The distribution of total hydrophobicity
Model sequences
We now turn to the distribution of the total hydrophobicity M. Table 2 shows h = (1 +
M
N/N)/2 and the
normalized variance
(see Eq. 5) for good HP sequences for
N = 12, ... , 18. Also shown in this table are the
two predictions
0 and
1 defined in
Methods, and a prediction
2 that will be explained
below. Note that h depends quite weakly on N.
This implies that the fraction of hydrophobic amino acids, unlike the
core to surface ratio of compact chains, does not increase with
N. Of course, it would be interesting to see whether this
trend persists for much larger N.
|
is smaller than
0, which
implies that the
i values are not both uncorrelated and
uniformly distributed. Comparing to
1 shows that the
major part of this difference is due to correlations rather than
non-uniformity. The fact that
<
1 means that
the average cij (i
j) is negative.
The two measurements h and
are, of course, not enough to
fully characterize the distribution of good sequences. To get an idea
of how much information they provide, we may compare to the one-dimensional Ising distribution
|
(13) |
for good N = 18 sequences can be reproduced by choosing
K1
0.16 and K2
0.13. For these parameters it turns out that
eS
1.9 × 105,
S being the entropy, which means that the effective number
of sequences contained in P(
) is considerably larger than
the number of good N = 18 sequences, 6349.
Enzymes
To study the N dependence of the total hydrophobicity M for the enzymes, we divide the data set into groups corresponding to different intervals in N. Fig. 4 shows the average M for these groups against N. We see that the N dependence is approximately linear. Although the uncertainties are difficult to estimate, it is interesting to note that the behavior is in perfect agreement with the model results.
|
in Eq. 9, using
= N(2
1) and
= 0.29, as
obtained from a fit to the data in Fig. 4. Table
3 shows
for all sequences and
for the different subgroups described in Methods. We see that
for all sequences is larger than predicted by Eq. 10, which contrasts
sharply with the model results above. We also note that there seems to
be a strong dependence on group. In particular there appears to be a
big difference between intra- and extracellular enzymes. However, it
must be stressed that the uncertainties are large. Improved statistics
are definitely needed in order to draw any firm conclusion about the
different groups and possible deviations from the model results.
|
Comments
Our study of HP sequences has been focused on
structure-independent properties. The question of how sequences that
share the same (unique) native structure are related has recently been
examined using similar models (Li et al., 1996
;
Bornberg-Bauer, 1997
; Bornberg-Bauer and Chan,
1999
). From these studies, a simple picture seems to emerge for
structures that are highly designable. For
high-Nr structures (Nr is
the number of sequences that fold to the structure r), it
has been found that the sequences tend to form a single cluster
connected by one-point mutations, called a "neutral net" (Bornberg-Bauer, 1997
), and that structure-preserving
mutations tend to be largely independent (Li et al.,
1996
). The latter property was observed in a model with compact
structures only. We checked that it holds in the present model too,
which is illustrated in Fig. 5. From this
figure it can be seen that the quantities
eS/Nr and
|
(r)|, as defined in Methods, indeed tend to be
small for high Nr. Also indicated in this figure
is whether or not the sequences form a neutral net, results first
obtained by Bornberg-Bauer (1997)
.
|
The fact that structure-preserving mutations are largely independent
for high Nr does not contradict our previous
results. To verify this, we calculated
from the known
hydrophobicity profiles
{hi(r)} under the assumption
that the
i values are independent for each structure.
The value obtained this way,
2, can be found in Table 2
above, and is indeed a relatively good approximation to the observed
.
Admittedly, the model used in this study is crude. In particular,
Buchler and Goldstein (1999
, 2000
) have recently argued, based on a study of compact lattice chains, that the use of a two-letter alphabet leads to designability artifacts, which disappear with increasing alphabet size. Let us stress, therefore, that the
analyses discussed in this paper can be tested on real proteins in a
direct manner. Let us also comment on the stability of our results.
First, we note that the dependence on chain length N is
weak. This was explicitly shown for
, and is true for
(s) too, although our discussion focused on one system
size in this case. Second, we note that our results are in nice
agreement with those obtained earlier using a simple hydrophobic/polar
off-lattice model (Irbäck et al., 1997
). To
further explore the model dependence of our results, we also did
calculations for a "solvation-like" two-letter model discussed by
Ejtehadi et al. (1998a
,b
) and by Buchler and
Goldstein (1999
, 2000
). This model differs from the HP model in
that the interaction strength is additive [
(H, H) =
2
,
(H, P) = 
and
(P, P) = 0], which means that the
total energy can be expressed as a simple sum of monomer contributions. Buchler and Goldstein argued that HP-like models, unlike pair-contact models with larger alphabets, tend to have solvation-like designability properties. It is therefore interesting to note that when analyzing sequences with unique ground states in the solvation-like model defined
above, we obtained results qualitatively different from those for the
HP model. More precisely, it turns out that the block fluctuations are
significantly larger, close to random, for the solvation-like model.
| |
Summary and Discussion |
|---|
|
|
|---|
Hydrophobicity plays a key role in the formation of protein
structures, which makes it of utmost interest to understand the statistical distribution of hydrophobicity along the chains. In this
paper we have analyzed hydrophobic/polar sequences in the two-dimensional HP lattice model. Whenever statistically feasible, the
analogous calculations were performed for a set of real enzymes, too.
Our main findings are as follows.
| 1. | Both model sequences and enzymes show mean-square block fluctuations (s) that are smaller than for random sequences. In particular, this implies that the enzymes display the same behavior that had been found previously for general proteins with typical total hydrophobicities (Irbäck et al., 1996 |
| 2. | The average total hydrophobicity M varies approximately linearly with chain length N over the range of N studied, both for model sequences and enzymes. This implies, contrary to what one naively might expect, that the fraction of hydrophobic amino acids does not grow with increasing N. The fluctuations in M are difficult to study for the enzymes, due to statistical uncertainties. For the model sequences it turns out that the normalized variance is significantly smaller than for random sequences.
|
We also divided the enzymes into different groups according to
their structural content, and to whether they reside in an intra- or
extracellular environment. The fluctuations in total hydrophobicity
appeared to depend on group. However, whether this dependence is
significant or not is difficult to say, due to statistical uncertainties. The mean-square block fluctuations are statistically much easier to measure, and show only a weak dependence on group. The
conclusion that
(s) is suppressed is, in particular, the
same for all the different groups.
A full explanation of the suppression of
(s) is probably
hard to give. Let us note, however, that long hydrophobic or
hydrophilic stretches in the amino acid sequence are likely to lead to
degenerate structures, and the suppression of sequences containing such
stretches should indeed tend to make
(s) smaller.
The nonrandomness of the block fluctuations provides an indirect confirmation of the important role played by hydrophobicity in the formation of protein structures. Furthermore, it is tempting to take the similarity with the model results as an indication that the ability to form a stable structure represents a significant selective advantage in the evolution of proteins. It would be interesting to check that the behavior remains the same in more realistic models.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by the Swedish Foundation for Strategic Research.
| |
FOOTNOTES |
|---|
Received for publication 18 January 2000 and in final form 23 May 2000.
Address reprint requests to Dr. Anders Irbäck, Lund University, Department of Theoretical Physics, Complex Systems Division, Sölvegatan 14A, S-22362 Lund, Sweden. Tel.: 46-46-222-3493; Fax: 46-46-222-9686; E-mail: irback{at}thep.lu.se.
| |
REFERENCES |
|---|
|
|
|---|
li, A.,
E. Shakhnovich, and M. Karplus.
1994.
Kinetics of protein folding: a lattice model study of the requirements for folding to the native state.
J. Mol. Biol.
235:1614-1636
Biophys J, November 2000, p. 2252-2258, Vol. 79, No. 5
© 2000 by the Biophysical Society 0006-3495/00/11/2252/07 $2.00
This article has been cited by other articles:
![]() |
T. Aynechi and I. D. Kuntz An Information Theoretic Approach to Macromolecular Modeling: I. Sequence Alignments Biophys. J., November 1, 2005; 89(5): 2998 - 3007. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Wroe, E. Bornberg-Bauer, and H. S. Chan Comparing Folding Codes in Simple Heteropolymer Models of Protein Evolutionary Landscape: Robustness of the Superfunnel Paradigm Biophys. J., January 1, 2005; 88(1): 118 - 131. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Selz, A. J. Mandell, M. F. Shlesinger, V. Arcuragi, and M. J. Owens Designing Human m1 Muscarinic Receptor-Targeted Hydrophobic Eigenmode Matched Peptides as Functional Modulators Biophys. J., March 1, 2004; 86(3): 1308 - 1331. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Moelbert, E. Emberly, and C. Tang Correlation between sequence hydrophobicity and surface-exposure pattern of database proteins Protein Sci., March 1, 2004; 13(3): 752 - 762. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Sandelin On Hydrophobicity and Conformational Specificity in Proteins Biophys. J., January 1, 2004; 86(1): 23 - 30. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Cui, W. H. Wong, E. Bornberg-Bauer, and H. S. Chan Recombinatoric exploration of novel folded structures: A heteropolymer-based model of protein evolutionary landscapes PNAS, January 22, 2002; 99(2): 809 - 814. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |