| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
Correspondence: Address reprint requests to Shinya Honda, National Institute of Advanced Industrial Science and Technology (AIST), Central 6, Tsukuba, 305-8566 Japan. E-mail: s.honda{at}aist.go.jp.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
1090) different folds, assuming that the degree of conformational freedom of backbone per residue is 8 (1
The first taxonomic investigations into the diversity of protein structures were carried out by Murzin and colleagues, who constructed the SCOP database (2
), followed by Orengo and colleagues, who developed the CATH database (3
). Several second-generation classifications were generated by computational programs such as DALI (4
), VAST (5
), and CE (6
). Recently, Kim and co-workers showed a three-dimensional map of the known folds projected onto the structure space of the protein universe (7
). In all cases, the proteins were divided into domains and the domain structures were classified. Using these taxonomic classifications we can evaluate structural relationships between proteins deposited in the Protein Data Bank (PDB). Nevertheless, these classifications do not provide sufficient information to determine the genuine distribution of protein structures, because the number of available domains is too small in comparison with the structure space of the protein universe. In fact, the number of folds is not identical but dependent on the classification methods (8
). Consequently, the current domain-based classification does not designate how many conformations within the vast structure space of polypeptides correspond to feasible protein structures (9
).
The objectives of this study are to classify the structures of protein segments, which are smaller than domains, and to analyze their distribution. The smaller size of segments enables us to increase the sampling density, which will increase the accuracy of the statistical analysis and provide a robust distribution. Although a domain has been classically regarded as a basic unit of proteins (10
), several studies have provided an alternative viewpoint describing proteins as hierarchical objects consisting of substructures smaller than domains (11
19
). Accordingly, the classification of segment structures would provide insights into the molecular architecture of these hierarchical objects. Several principles for defining substructures inside a domain have been proposed so far, but none of them are universally accepted. Thus, no a priori definitions of substructures were adopted in this study. A series of L-residue long segments was generated by sliding a clipping window sequentially along target protein sequences (L-tuple analysis). Each set of local structures consisting of consecutive L amino acids was classified by a single-pass clustering method (20
), which is one of unsupervised nonhierarchical clustering algorithms. Structural dissimilarity between segments was defined on the basis of backbone dihedral angles. Two data sets containing target proteins were used. Each was a subset of the PDB from which redundant structures and low resolution data were removed. Both data sets were analyzed independently and compared to obtain reliable statistical results.
The results of these exhaustive analyses, reported here, are that the structures of protein segments are located only in tiny regions of the protein universe and distributed in a dense-sparse manner. Also, their diversity follows a power-law distribution. This indicates that proteins are organized on a certain mathematical regulation using a limited number of local structures. Moreover, our analysis of the clusters of classified segments revealed that the limitation of the number of local structures is not only due to the conformational preference of single residues. These results are an attractive outcome because they are quite similar to those found in the structure of natural languages.
| METHODS |
|---|
|
|
|---|
![]() | (1) |
A normalized frequency of occurrence fcls can be obtained empirically by the equation, fcls(r) = Mr/M, where M and r are the total number of segments and the rank of each cluster, respectively. The rank r corresponds to the decreasing order of fcls. To hold the statistical confidence, data from the clusters whose fcls were <2.0 x 105 were discarded, and the "effective" number of total clusters Ncls was recounted. (Note that Ncls is smaller than the raw number N.) As described in Results, fcls shows a fine power-law distribution except for the region around r = 1. To characterize the distribution, we also calculated an "estimated" frequency fest by fitting fcls to the Mandelbrot formula (generalized Zipf's law) with some modifications. The definition of fest is as follows.
![]() | (2) |
In the fitting calculation, three parameters, a, b, and ß in Eq. 2, were determined numerically by minimizing Eq. 3.
![]() | (3) |
The equation can be interpreted on the assumption that the expected error in f(r) should be proportional to the square root of f(r). The use of the equation is effective for obtaining a good fitting result in a double-logarithmic scale plot. Details as to the objective function in fitting calculations are described in Supplementary Materials. To compensate for the underestimation of empirical frequency of the rarest substructures, which arises from the "zero frequency problem," we introduced another parameter, the estimated number of total clusters Nest and computed it by using the following equation,
![]() | (4) |
![]() | (5) |
Hydrogen bonds between backbones were assigned using the DSSP (24
). Only the bonds whose stabilization energy exceeded 1 kcal/mol were taken into account. When a segment did not contain both H-donor and H-acceptor atoms, the number of hydrogen bonds was treated as 0.5. The backbone root-mean-square (RMS) deviation (Å) represents the deviation in Cartesian coordinates of three atoms (N, C
, and C) in assigned segments from their centroidal positions (cluster center). The radius of gyration RG (Å) is calculated with the C
coordinates of the cluster center. To detect the amino acid preference in each cluster, the Kullback-Leibler relative entropy
I(p(r)|p0)
is defined by the following formula (25
):
![]() | (6) |
I
allows various combinations of amino acid sequences. A pseudocluster consisting of Mr number of segments that have no structural similarity among them was also examined as a control group to analyze a statistical significance of physicochemical properties of "real" clusters. Pseudoclusters were generated by randomly choosing a segment Mr times from the same set of segments that was used in clustering calculation.
| RESULTS |
|---|
|
|
|---|
|
The only notable deviation between the two curves is seen in the lower right region of Fig. 1 a. This region corresponds to the rarest substructures, which appear only once or twice among thousands of segments, so the empirical frequencies of these substructures are sensitive to the number of segments analyzed (the so-called "zero-frequency problem" (26
)). In fact, despite considerable agreement between the two curves, the total number of clusters N (maximum value of r) differed almost threefold (Table S1 in Supplementary Materials). To eliminate this unessential complication, we excluded clusters with fcls values less than 2.0 x 105 and introduced another parameter, the "effective" number of total clusters Ncls. This cut-off handling resulted in Ncls values that were comparable for the two data sets (Table S1 in Supplementary Materials).
To investigate the influence of the chain length on the distribution functions of local structures, the same type of clustering analysis was conducted after assigning various values (1
31
) to L. The resultant curves are shown in Fig. 1, b and c, and the parameters are summarized in Table 1. Although some differences are seen between Fig. 1, b and c, for short segments (L < 7), the distribution curves for long segments (L > 7) are quite similar. Likewise, the relative differences in Ncls between the two data sets are within 15% for long segments (Table 1). Accordingly, the effect of data set size is less pronounced in the case of long segments.
|
|
Degeneracy of the structural diversity
If z denotes the degree of "intrinsic" conformational freedom per residue, the total number of possible backbone conformations for a fragment consisting of L residues should be zL. If one assigns a value of 8 to z (1
), the total number of conformations available to a nine-residue fragment would become 1.3 x 108. However, the estimated number of clusters for a nine-residue segment was only
104 (Table 1), indicating that the actual diversity of the local structure is quite limited. Here we introduce z/
as the degree of "effective" conformational freedom per residue. The parameter
is a "diminishing factor" (29
), which expresses the degeneracy of the actual diversity in the protein universe. To determine the amount of degeneracy, (logNest)/L were plotted against L in Fig. 3, because they are in principle equivalent to log(z/
). The resultant values of z/
were not influenced appreciably by the sizes of the data sets (Fig. 3, a and b). In contrast, these values were obviously affected by L, proving that the degeneracy depends on the segment length. The plot of z/
versus L appears as an asymptotic curve. The value of z/
decreases gradually and monotonously with increasing L. The value appears to reach a constant level at L = 31, though the absolute value depends on a threshold parameter Dth: z/
= 1.61.7 and 1.51.6 at Dth = 30° and 40°, respectively. This result enables us to predict the number of protein conformations. Using heuristic regression analysis to fit the data to a hyperbolic function showed that z/
extrapolates to 1.21.5 at L = 100 (the typical domain size of proteins). In addition, Fig. 3 suggests that the structural diversity of the middle length segment has been shrunk to considerable extent. According to thermodynamic analysis, the degree of backbone freedom for an unfolded protein is
8 per residue (1
), which is clearly larger than the z/
values of the middle length segment. Consequently, the data in Fig. 3 can be understood as follows: the decreasing curve represents the changes in structural degeneracy in passing from "the polypeptide world" (z/
8) to "the protein world" (z/
1.3). From this viewpoint, it is conceivable that the structural diversity of 10- to 20-residue segments is already approaching that of the protein world.
|
|
100°.) In contrast, this number was relatively stable between 30° and 40°. This indicates that the dependence of a threshold parameter on the clustering results is minimum around Dth = 3040°. In Fig. 5 b we can see that the shapes of the distribution of 13-residue segments are quite similar to each other when analyzed with Dth = 30° and 40°.
|
Summary of classified structural motifs
An in-depth analysis of each cluster is beyond the scope of this article. Here we describe only a few clusters that were obtained under a typical condition (L = 9, Dth = 30°, Culled PDB) to illustrate that our clustering method succeeded in the extraction of distinct structural motifs, including known canonical ones (Fig. S1 in Supplementary Materials). The structure of the most frequently seen cluster at r = 1 is a regular
-helix having almost identical dihedral angles at every position (
= 64 ± 2.9°,
= 41 ± 3.1°,
= 180 ± 0.7°). The backbone RMS deviation between assigned segments is 0.36 Å, which is the smallest deviation among the top 1000 clusters. The statistical analysis of the sequences showed that the propensity of amino acids does not exceed 1.7 for any position. This remarkable feature, i.e., minimum structural distortion with low sequence specificity, might be responsible for the large deviation of the empirical frequency fcls from the estimated frequency fest at r = 1, as shown in Fig. 2. The fully extended ß-strand (
= 116 ± 6.0°,
= 140 ± 5.3°,
= 178 ± 1.2°) appears in the cluster at r = 3. It is reasonable that overall propensities of branched hydrophobic amino acids such as Val and Ile are relatively high in this cluster. The clusters at r = 2 and r = 5 correspond to helix capping motifs. The former is the type IIb N-cap motif (31
), having the consensus sequence of [Asp, Asn, Ser, Thr]-Pro-[Glu, Asp]-[Gln, Glu]. The latter is the type IV C-cap motif (31
) (i.e., Schellman motif), having the consensus sequence of [Glu, Lys, Ala, Arg]-[Leu, Arg, Ala, His]-Gly. Among the many types of hairpin structures, the most frequent one appears in the cluster at r = 79, whose structure is the ß-hairpin consisting of two ß-strands and a two-residue loop. The loop corresponds to a type I' ß-turn, which is an abnormal type in the ß-turn statistics. This preference for an abnormal ß-turn inside the most frequently found ß-hairpin has already been revealed by Sibanda and Thornton in their statistical analysis (32
). Accordingly, all characteristics of the clusters presented here are coincident with previous investigations of known structural motifs, indicating that our clustering method is effective in analyzing structural motifs and is able to extract numerous motifs simultaneously and exhaustively.
Popularity of local structures
What determines "the popularity" of local structures, their relative occurrence in the protein universe? To address this question, we analyzed four physicochemical properties of each cluster, namely, the number of hydrogen bonds, the structural dispersion in Cartesian coordinates, the compactness, and the sequence specificity, and compared these properties to those of pseudoclusters. Fig. 6 a shows that the number of backbone-backbone hydrogen bonds per segment, taking account of both intra- and intersegment hydrogen bonds, correlates with the rank of clusters. This indicates that the more hydrogen bonds the local structure contains, the more frequently it occurs. In contrast, neither the number of intrasegment or intersegment hydrogen bonds shows a clear relationship with the rank (data not shown). Only the sum of hydrogen bonds has an obvious correlation with rank. These results imply that the distinction between intra- and intersegment bonds does not affect the cluster ranking, and further suggest that the individual foldability or "autonomy" of local structure would not be a criterion of its popularity. Next, the backbone RMS deviations of assigned segments were computed as an indication of the structural dispersion within each cluster (Fig. 6 b). The RMS deviations ranged from 0.4 to 2.5 Å and tended to increase with the rank. (The average,
1.2 Å, corresponds to the level of "good" quality in NMR structure determinations.) This correlation can be explained by postulating that the popular local structures have little structural fluctuation and represent deep local minima on the potential energy surface. Considering the results in Fig. 6, a and b, hydrogen bonds, regardless of whether they are intra- or intersegmental, are likely to reduce fluctuations in the local structure. In contrast to the above two properties, the radius of gyration of a segment RG, being characteristic of the compactness of the local structure, does not show a simple relationship to the rank (Fig. 6 c). The values of RG for the high-ranked clusters are displayed in a binary fashion around either 4.6 or 8.2 Å, whereas RG for low-ranked clusters are intermediate between these two values. Because the two values correspond to the sizes of the canonical
-helix and ß-strand structures, respectively, this demonstrates that neither the type of secondary structure nor the compactness of the local structure affect the popularity. In contrast, the regular secondary structures, both
-helix and ß-strand, occur more frequently than complicated structures, e.g., combinations of
-helices and ß-strands, as well as others. Finally, the sequence specificity of local structures was evaluated by computing the Kullback-Leibler relative entropy
I
(25
), based on the propensities of position-specific amino acids. Compared with pseudoclusters, all clusters show larger values in
I
, which indicates that the sequence specificity of each local structure is evidently higher than that of the same number of segments that were selected randomly (Fig. 6 d). In contrast, the relation between
I
and r is not simple. One of the things we can see in Fig. 6 d is that the values of
I
for low-ranked clusters are considerably dispersed, whereas the values for the high-ranked clusters are rather close to the level of pseudoclusters. The result would suggest that the degree of sequence specificity may also affect the popularity of local structures, and that a local structure that does not require particular amino acids tends to occur more frequently. In other words, local structures with high "designability" (33
) may be preferred in the protein universe.
|
| DISCUSSION |
|---|
|
|
|---|
The structure distribution for long segments was formulated well by the modified Mandelbrot formula (Eq. 2 and Fig. 2). The order of power law was approximately unity. For instance, ß = 0.870.89 and 0.880.89 at L = 15 and 21, respectively. These values are almost the same as the comparable parameter obtained from theoretical folding simulations using a two-dimensional HP lattice model, in which ß-values were 0.94 (L = 16) and 0.86 (L = 18) (39
). Interestingly, a similar convergence was reported in linguistics, in which the parameter ß of the distribution of English vocabulary decreases from 1.6 to 1.15 as a child grows, and reaches 1.0 in the books of a professional novelist (40
). Similar distributions appear in other natural languages, such as French and Japanese. Conventionally, the structure of a protein molecule has been analogized with a grammar (41
). The results of this study imply that this resemblance is not just a metaphor and that a protein structure and a language structure probably share common rules and have a quantitative correlation.
The number of conformations of protein segments converged from 2.5L2.9L at L = 9 to 1.5L1.7L at L = 31 with increase of the segment length. When the data in Fig. 3 were heuristically fitted to a simple hyperbolic function, although this function has no theoretical ground to characterize the decays, the extrapolated values were between 1.2L and 1.5L at L = 100. We therefore estimate that structural diversity of proteins would reside within the range between 1.2L and 1.7L, depending on their chain length. Interestingly, Dill has reported similar values, 1.4L or 1.7L, as the upper limit of the number to conformations of globular states by using a three-dimensional lattice model (29
). Recently, Kim and coworkers have also obtained an equivalent value, 1.6L, from the analysis of short segments (L = 27) by the multidimensional scaling algorithm (42
). It should be noted that these estimated numbers are comparable despite the difference in analytical method, and that all numbers are smaller than 2L. This indicates that the number of protein structures is evidently smaller than the number of random combinations of two sets of dihedral angles corresponding to two typical secondary structures, i.e.,
-helix and ß-strand. On the other hand, the number of conformations of an unfolded protein has been estimated (1
) to be
8L. Since it is reasonable to regard the structural diversity of an unfolded protein as equivalent to that of a random polypeptide, the structural degeneracy of a protein, defined as the ratio of the structure space of existent proteins against the vast protein universe, can be estimated to be (1/7)L(1/5)L.
The structural diversity appears to exhibit a boundary at 1020 residues. With increasing L, both the degree of conformational freedom per residue z/
and the structure entropy per residue Sest/L decrease gradually and then reach an almost constant level at L = 1020 (Figs. 3 and 4), suggesting that 10- to 20-residue segments are already proteinlike and that their nature differs considerably from that of shorter segments. This can be related to several other types of research in protein science. In a fragment assembly method (43
), which is one of the most accurate methods of structural prediction (44
), the length of nine-residue segments is empirically known to be suitable to produce excellent results in predictions. Also, in the fields of molecular evolution and of protein folding, a protein molecule is often considered as a hierarchical object composed of smaller units than the typical size of a domain (11
19
). Our results may help to explain the basis for the minimum chain length of local structural units of proteins.
As for longer segments, a double logarithmic plot of the number of clusters versus the cluster size determined by the number of assigned segments appears linear (Fig. 7), showing that this relationship also follows a power law. The order of power law
was 2.2 for 21-residue segments. Similar relationships have been discovered recently between the numbers of folds and families (45
), between the numbers of families and domains (45
), and between the numbers of clusters of similar domain and domains per cluster (30
). In their analyses,
were reported as 2.5, 3.0 (or 1.9), and 2.5, respectively. These examples of power-law behavior between upper and lower categories indicate that there are recursive relationships in a protein molecule in the progression of fold-family-domain-segment-sequence. Furthermore, this hierarchical self-similarity suggests that a protein holds fractal characteristics in its structure. The fractal characteristics of protein structure have been suggested by early investigations, such as the temperature dependence of ESR spectra (46
) and the differential-geometric analysis (47
). As to the reason for the fractal characteristics, Allen et al. proposed that it is associated with the requirement that these linear (one-dimensional) polymers having no branches should fold into a three-dimensional compact structure (46
). The fractal characteristics may underlie not only the hierarchical organization of a protein molecule but also the diversity of protein structures.
|
The dependence of M1 on Dth showed a plateau stage like a landing of stairs, which is not found in the same analysis for pseudosegments (Fig. 5 a). This implies that the distribution of real segments is not random graphlike, and that local structures are distributed in a dense-sparse manner in the protein universe. Thus, the distribution of existent proteins in the protein universe should resemble galaxies in the real universe. This galaxy model allows interpretation of protein structures by means of the simple combination of known local structures. The sparse regions between galaxies correspond to inappropriate structures for a protein. Consequently, by eliminating the sparse regions we can reduce effectively the size of candidates that must be considered without losing the actual diversity of a protein.
Two data sets were analyzed independently to validate statistical reliability in this study. The resulting distribution of local structures hardly changed between the two data sets. This fact demonstrates that representative proteins listed in two data sets were fairly selected from the PDB without bias, and suggests that these data sets cover most of all local structures that can exist in a protein molecule. Accordingly, though it is expected that many proteins having a novel fold will be discovered by structural genomics projects, we guess that their structures can be mostly expressed by combinations of the local structures clarified here. Moreover, we believe that the distribution of local structures provided in this study does not differ significantly from the genuine distribution of the local structures of all natural proteins, including ones whose structures are still unknown.
Recently, Kim and co-workers have shown a three-dimensional map of the protein-fold space that helps us to understand a global feature of the protein structure universe (7
). In this study, we analyzed the diversity of proteins and their distributions by focusing on the local structures of segments (L = 131). Consequently, we conclude that the local structures of proteins are distributed according to a power law and are localized in the protein universe. Very recently, Higo and co-workers reported on conformational distribution of short segments through a different approach, a principal-component analysis using intrasegment C
-C
atomic distances, and discussed several structural motifs, including novel ones (48
). Also, Kim and co-workers carried out a multidimensional scaling analysis of short segments (L = 27) using two dihedral angles,
and
, and concluded a dramatic reduction of conformational space by projecting an intrinsically multidimensional space of the protein universe on a three-dimensional map (42
). Their conclusion is conceptually consistent with the galaxy model of this study. We think the next question is how the distribution of local structures quantitatively correlates to the distribution of local sequences. As one of the applications, we recently succeeded in designing a small folded peptide consisting of only 10 amino acids according to the original strategy that was developed based on knowledge about the uneven distribution of local structures (49
), which has been fully described in this study. We believe that a deep understanding of the correlation between the diversities of structure and sequence will encourage the advance of future studies on structure prediction and molecular evolution as well as protein design.
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Submitted on October 26, 2005; accepted for publication May 9, 2006.
| REFERENCES |
|---|
|
|
|---|
2. Murzin, A. G., S. E. Brenner, T. Hubbard, and C. Chothia. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536540.[CrossRef][Medline]
3. Orengo, C. A., A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. 1997. CATH: a hierarchic classification of protein domain structures. Structure. 5:10931108.[Medline]
4. Holm, L., and C. Sander. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233:123138.[CrossRef][Medline]
5. Madej, T., J. F. Gibrat, and S. H. Bryant. 1995. Threading a database of protein cores. Proteins. 23:356369.[CrossRef][Medline]
6. Shindyalov, I. N., and P. E. Bourne. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11:739747.
7. Hou, J., G. E. Sims, C. Zhang, and S. H. Kim. 2003. A global representation of the protein fold space. Proc. Natl. Acad. Sci. USA. 100:23862390.
8. Liu, X., K. Fan, and W. Wang. 2004. The number of protein folds and their distribution over families in nature. Proteins. 54:491499.[CrossRef][Medline]
9. Taylor, W. R. 2002. A periodic table for protein structures. Nature. 416:657660.[CrossRef][Medline]
10. Jaenicke, R. 1999. Stability and folding of domain proteins. Prog. Biophys. Mol. Biol. 71:155241.[CrossRef][Medline]
11. Gilbert, W. 1978. Why genes in pieces? Nature. 271:501.[CrossRef][Medline]
12. Blake, C. C. 1978. Do genes-in-pieces imply proteins-in-pieces? Nature. 273:267.[CrossRef]
13. Go, M. 1981. Correlation of DNA exonic regions with protein structural units in haemoglobin. Nature. 291:9092.[CrossRef][Medline]
14. Seidel, H. M., D. L. Pompliano, and J. R. Knowles. 1992. Exons as microgenes? Science. 257:14891490.
15. Karplus, M., and D. L. Weaver. 1976. Protein-folding dynamics. Nature. 260:404406.[CrossRef][Medline]
16. Baldwin, R. L., and G. D. Rose. 1999. Is protein folding hierarchic? I. Local structure and peptide folding. Trends Biochem. Sci. 24:2633.[CrossRef][Medline]
17. Iwakura, M., T. Nakamura, C. Yamane, and K. Maki. 2000. Systematic circular permutation of an entire protein reveals essential folding elements. Nat. Struct. Biol. 7:580585.[CrossRef][Medline]
18. Lupas, A. N., C. P. Ponting, and R. B. Russell. 2001. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol. 134:191203.[CrossRef][Medline]
19. Rost, B. 2002. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12:409416.[CrossRef][Medline]
20. Richards, J. A., and X. Jia. 1999. Remote sensing digital image analysis. Springer-Verlag, New York.
21. Hobohm, U., M. Scharf, R. Schneider, and C. Sander. 1992. Selection of representative protein data sets. Protein Sci. 1:409417.[Abstract]
22. Wang, G., and R. L. Dunbrack, Jr. 2003. PISCES: a protein sequence culling server. Bioinformatics. 19:15891591.
23. Shannon, C. E. 1951. Prediction and entropy of printed English. Bell Syst. Tech. J. 30:5164.
24. Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22:25772637.[CrossRef][Medline]
25. Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Stat. 22:7986.[CrossRef]
26. Witten, I. H., and T. C. Bell. 1991. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inf. Theory. 37:10851094.[CrossRef]
27. Zipf, G. K. 1949. Human behavior and the principle of least effort. Addison-Wesley, Cambridge, MA.
28. Mandelbrot, B. 1953. An information theory of the statistical structure of language. In Communication Theory. W. Jackson, editor. Academic Press, New York. 486502.
29. Dill, K. A. 1985. Theory for the folding and stability of globular proteins. Biochemistry. 24:15011509.[CrossRef][Medline]
30. Dokholyan, N. V., B. Shakhnovich, and E. I. Shakhnovich. 2002. Expanding protein universe and its origin from the biological Big Bang. Proc. Natl. Acad. Sci. USA. 99:1413214136.
31. Aurora, R., and G. D. Rose. 1998. Helix capping. Protein Sci. 7:2138.[Abstract]
32. Sibanda, B. L., and J. M. Thornton. 1985. Beta-hairpin families in globular proteins. Nature. 316:170174.[CrossRef][Medline]
33. Li, H., R. Helling, C. Tang, and N. Wingreen. 1996. Emergence of preferred structures in a simple model of protein folding. Science. 273:666669.[Abstract]
34. Strait, B. J., and T. G. Dewey. 1996. The Shannon information entropy of protein sequences. Biophys. J. 71:148155.
35. Luscombe, N. M., J. Qian, Z. Zhang, T. Johnson, and M. Gerstein. 2002. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol. 3:R00401R00407.
36. Wuchty, S. 2001. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18:16941702.
37. Barabasi, A.-L. 2002. Linked: the new science of networks. Perseus Books, Cambridge, MA.
38. Czirok, A., R. N. Mantegna, S. Havlin, and H. E. Stanley. 1995. Correlations in binary sequences and a generalized Zipf analysis. Phys. Rev. E. 52:446452.[CrossRef]
39. Bornberg-Bauer, E. 1997. How are model protein structures distributed in sequence space? Biophys. J. 73:23932403.
40. Pierce, J. R. 1980. An introduction to information theory: symbols, signals and noise. Dover, New York.
41. 2002. Folding as grammar. Nat. Struct. Biol. 9:713.[CrossRef][Medline]
42. Sims, G. E., I. G. Choi, and S. H. Kim. 2005. Protein conformational space in higher order
-
maps. Proc. Natl. Acad. Sci. USA. 102:618621.
43. Simons, K. T., C. Kooperberg, E. Huang, and D. Baker. 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268:209225.[CrossRef][Medline]
44. Kinch, L. N., J. O. Wrabl, S. S. Krishna, I. Majumdar, R. I. Sadreyev, Y. Qi, J. Pei, H. Cheng, and N. V. Grishin. 2003. CASP5 assessment of fold recognition target predictions. Proteins. 53(Suppl. 6):395409.[CrossRef][Medline]
45. Koonin, E. V., Y. I. Wolf, and G. P. Karev. 2002. The structure of the protein universe and genome evolution. Nature. 420:218223.[CrossRef][Medline]
46. Allen, J. P., J. T. Colvin, D. G. Stinson, C. P. Flynn, and H. J. Stapleton. 1982. Protein comformation from electron spin relaxation data. Biophys. J. 38:299310.
47. Isogai, Y., and T. Itoh. 1984. Fractal analysis of tertiary structure of protein molecules. J. Phys. Soc. Japan. 53:21622171.[CrossRef]
48. Ikeda, K., K. Tomii, T. Yokomizo, D. Mitomo, K. Maruyama, S. Suzuki, and J. Higo. 2005. Visualization of conformational distribution of short to medium size segments in globular proteins and identification of local structural motifs. Protein Sci. 14:12531265.
49. Honda, S., K. Yamasaki, Y. Sawada, and H. Morii. 2004. 10 residue folded peptide designed by segment statistics. Structure. 12:15071518.[Medline]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||