| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |

* Graduate Group in Biophysics, and
Department of Pharmaceutical Chemistry, University of California, San Francisco, California
Correspondence: Address reprint requests to Dr. Irwin D. Kuntz, Dept. of Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, CA 94143-0446. Tel.: 415-476-1937; Fax: 415-502-1411; E-mail: kuntz{at}cgl.ucsf.edu.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Sequence alignment is an integral part of comparative modeling protocols. Aside from ab initio methods (11
), theoretical structure prediction is generally approached in two steps:
A standard way to gain insight into complex problems is through SEMs. In the protein-folding field, these models use simplified representations of sequences and structures to mimic sequence and structure interactions in real systems. Thus, self-avoiding two-dimensional lattice walks and simplified alphabets have long been used to evaluate and understand the principles of protein folding (13
,14
). The ability to exhaustively enumerate all states of the system affords the opportunity to describe the system's behavior unambiguously, and it can provide a clear path relationship between assumptions and consequences. Thus, SEMs are well suited for formulating and evaluating general concepts: a task that may be much more difficult with real-world examples because of their heavy parameter dependence and need for approximations (15
).
In this article, we combine the use of simplified systems with information theory to derive the costs of alignment procedures, scoring matrices, and gap penalties of idealized models. We then consider the applicability of the insights gained to the current approaches to sequence alignment. As noted above, our modeling choices are chosen to illuminate the underlying issues. We consider force fields in the companion article (10
).
| METHODS |
|---|
|
|
|---|
Of course, it is not feasible to write out all sequences for all proteins and nucleic acids. Nor can we advance a comprehensive model for the evolutionary and structural constraints that give rise to the sequences that form the current pan-genomic database. Rather, our strategy will be to uncover general properties by making use of model systems and simplified alphabets (14
,16
). However, we are also interested in exploring the implications of such models for the real world of sequences and conformations. This relationship is not a formal part of information theory, and will involve additional assumptions or hypotheses, the truth of which must be established by other methods. For example, it is straightforward to evaluate the informational consequences of the proposition that the known sequences are a random subset of all possible sequences; this proposal can be directly tested statistically, but information theory, alone, cannot determine its validity.
Shannon information of a set W
The information required to select an individual entity from a set of W objects is defined as
![]() | (1) |
![]() | (2) |
As mentioned previously, clustering can lead to a change in information. We relate Eqs. 1 and 2 to yield the information gain/loss of a clustering procedure as
![]() | (3) |
Sequence alignment
Overview
Sequences of proteins or nucleic acids of unknown structure and function are sources of information through association with other sequences whose functions/structures are already known. The most widely used associative process is alignment. Alignment algorithms can be divided into two categories, global and local. A global alignment (17
) looks for the best overall similarity among sequences, whereas a local alignment (18
) searches for similar sub-sequences between two proteins. Both of these algorithms make use of a variety of scoring matrices and gap penalties (19
24
). Sequence-alignment problems are underdetermined, having multiple optimal solutions depending on the parameters used. Thus far, there has not been a quantitative analysis of the parameter dependence, one reason being the absence of a standard comparison metric. With an information theoretic approach, we are able to formalize the effects of parameters such as sequence length, alphabet size, etc. We consider the sets of sequences of length N, drawn from an alphabet of A characters. Assume that the characters are used with equal frequency (effects of character correlation can be readily included at a later stage). With this simplifying assumption, each sequence has equal weight and there are AN unique sequences. The information content of the set is simply Nlog2A. Alignment procedures require the definition of a template of length T < N. The template may contain gapsthat is, the string for the template may contain one or more positions that match any character. Alternatively, the template may be considered continuous and the sequences with which it is being compared can contain gaps. We ask how many sequences of an exhaustive list match a specific template. Most generally, because there is nothing of special interest for any given template, we are interested in the information content averaged over all templates of a certain type.
We begin with the case of gapless pairwise alignments and then move to multigapped alignments. We will use both exhaustive and stochastic data sets, along with simple alignment models, to provide insight into the informational issues associated with sequence comparisons.
Statistics from alignments are gathered under two scenarios:
Gapless alignments
For an A-letter alphabet, the total number of possible N-letter sequences is AN,
![]() | (4) |
![]() | (5) |
For single-occurrence, ungapped alignments, we have an alternative approach using 1), the contiguous string; and 2), the standard formula for the probability of failure to match, pF. Given the probability of occurrence of the template in a single sequence, pT = (1/A)T, and the number of independent attempts (K+1), pF = (1 pT)K+1. The probability of a hit for a sequence, pH, is then (1 pF) and the total number of hits for the set is AN x pH,
![]() | (6) |
However, Eq. 6 provides useful values for I for the full range of K for single occurrence of templates (see Results and Discussion, below).
Gapped alignments
For more general gap distributions, in which all templates of length T = N K are aligned against a probe of length N, we need to consider the combinatorial arrangement of gaps of varying length. For gaps of minimum-length one, there will be C (N, K) = N!/K!(NK)! ways to arrange the gaps in an N-long sequence. However, if we require the minimum-gap size, Gmin, to be greater than one, then the effective length of the sequence is reduced to Neffective = N K x Gmin + K. There are AK sequences for each arrangement:
![]() |
We have also found a formulation leading to an exact solution for the number of gapped matches for single occurrences of the template. The number of hits to match a given template of length T, where K = N T, against an exhaustive set of sequences becomes
![]() | (8) |
![]() | (9) |
Gap penalties
The formulas above quantify the amount of information associated with successful alignments when an exhaustive basis set of all possible sequences is available. They also can be used to set bounds on gap penalty values (see Results and Discussion).
Gap penalties can be derived by examining the length distributions of gaps in systems where structural alignment is possible. This approach is based on a general affine model of gap penalties (25
) and uses a geometric distribution to assess the probability distribution of gaps, yielding, in the Qian and Goldstein treatment (26
), the formula for gap initiation (
I) and gap extension (
E) penalties of
![]() | (10) |
is the half-decay length of the gap length distribution. The values for Pg and
are determined in a similar way to Qian and Goldstein (26
![]() | (11) |
)/[1exp(1/
)].
|
I) and extension(
E) penalties.
Search algorithm methods
First-occurrence alignments
For every sequence, S, in the set, given a template T, we look for the first occurrence of the symbol in the first position of the template. Looking forward, we search for the first instance of the symbol in the second position of the template and so forth, until the last position in T. If a symbol is not observed in S in order of appearance in T, the search is terminated. Indices of each hit in the sequence are tabulated to determine the length of the gap among instances of each symbol present in the template.
Multiple-occurrence alignments
Fig. 2 shows how the occurrences of a template T are sought in a sequence, S, consisting of an alphabet of size A. The final list contains all the occurrences of T in S by specifying the indices of the symbols in S. The positional indices for each occurrence are used to determine the distribution of gap lengths.
|
A harder question is the relationship of the observed sequences to the exhaustive set. Many hypotheses can be put forward about the mechanisms of evolution and the types of structural constraints imposed upon both nucleic acids and proteins. We do not propose in this article to select among them. Instead, we provide simple examples to illustrate how the mapping from exhaustive sequences to sequence subsets changes the information content of alignment operations and hence changes the values of gap penalties. These simple hypotheses are:
To examine the information content of a random subset of the exhaustive sequences, we generated sets of 10,000100,000 random sequences of lengths N = 10, 20, 30 for A = 2. These were scanned with templates of various lengths and gap lengths. The number of hits was recorded with each of the sequences as a starting point and the probabilities of clustering were calculated. The information content for each alignment procedure was tabulated.
To generate a simple evolutionary model, we used the constraint that L of the N positions would not vary. This assumption produces a subset of sequences that are in exact correspondence to sequences from the exhaustive list for N' where N' = N L. Equations 5 9 can then be applied directly to this subset.
Correlation of sequence alignment and conformational resolution
Of course, one random model and one simple evolutionary model just begin to explore the sequence constraints operating on the natural sequences. Presumably, one of the critical limits is that most of the experimental sequences arise from sequence subsets that provide stable three-dimensional (tertiary) structures for some range of physical variables. We have not attempted to construct such a model in this article, but others have approached the problem (29
,30
).
In a seminal article, Chothia and Lesk (31
) provided the first quantitative relationship between sequence identity and structural similarity. In recent work, this relationship has been revisited in great detail (32
). For our purposes, we use the methods above and those of Sullivan and Kuntz (33
) to compare the information content of sequence and structural alignments, as follows. As a model of real proteins, we choose the backbone conformations for 100-mer compact polyalanine chains, a 20-letter alphabet, and a multiple-occurrence gapped alignment model. For a given homology level, we then compare the information of sequence and structural alignments. For our example, we select a similarity level of 90%. We use Eqs. 2, 3, and 7 to determine the information from sequence alignments. From Chothia and Lesk, a 90% identity yields a root-mean-square deviation (RMSD) of
0.5 Å. Therefore, we ask: What is the conformational probability of 100-mer compact polyalanine models falling within 0.5 Å RMSD, as derived by Sullivan and Kuntz, to determine the information required for a corresponding structural alignment?
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
|
|
|
|
Our basic findings are:
|
|
|
|
|
The empirical gap penalties currently in use are obtained from training sets on homologous proteins. Our approach in this article has been to explore simple, exhaustive treatments of sequence alignment with a much broader reference state. We show that these efforts lead directly to a priori gap penalties that depend on models of the protein and nucleic-acid sequence universe. Our formulas suggest alphabet-size and sequence-length dependencies that are not included in current methods. They lead directly to two practical suggestions:
Most of our results are at the low end of reported range of gap penalties. One reasonable explanation is that the set of known sequences is significantly nonrandom because of some combination of evolutionary and structural constraints; for example, reduced gap probabilities inside secondary structure elements, which would lead to higher-than-random gap penalties for gaps in loops. Although we cannot say that gap penalties based on SEMs will lead to better alignments than current methods, we hope that exploring the underlying relationships in simple systems will lead to improved understanding of the basic principles.
At a higher level, we can use the machinery set up above to ask how the information content of sequence alignment compares to the information content of structural alignments. To do this we draw on the basic formulas derived above. We also use the work of Chothia and Lesk (31
), which establishes the relationship between sequence identity and structural variation, and the article of Sullivan and Kuntz (36
), which gives log odds probabilities of structural variation. For this purpose, we treat a specific example of a compact 100 polyalanine backbone conformations because the Chothia-Lesk relation applies to native proteins. We ask what is the information content of a 90% identity match using a 20-letter, equal frequency, alphabet. From Eq. 7 we get 87 bits of required information (IM) for a 90% identity alignment. The full protein requires 432 bits (IS) (Eqs. 1 and 2), so the information gain from a 90% alignment is 345 bits (Eq. 3). A correction for the known frequency of use of the amino acids is
17 bits, so that a 90% homology match using realistic frequencies contains something approaching 3.5 bits/residue of information. From Chothia and Lesk, as well as later work (32
), we see that 90% homology implies a structural variance of
0.5 Å. Using the data from the cumulative distribution function of Sullivan and Kuntz (36
) and Eq. 2 of this article, we can estimate how much information is required to achieve an RMSD of 0.5 Å for a stochastic population of compact polyalanine chains. We get
2.42.5 bits/residue required to select this RMSD distribution from a stochastic set of compact chains. This calculation indicates that, roughly speaking, high-end sequence alignment combined with homology modeling could approach the quality of direct structural measurements for determining backbone geometries. It also implies that the designability hypothesis (i.e., many sequences per structure) derived from lattice models (37
) can also be supported from information content assessments of off-lattice conformational estimates.
In the companion article following, we extend these calculations to include comparison with force fields as well, showing how the use of information theory allows direct comparison of quite diverse techniques.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by a grant from the National Science Foundation (No. CHE-0118481), R. Kip Guy, Principal Investigator; a National Institutes of Health grant (No. RR019864), C. Pancerella, Principal Investigator; and a University of California President's Dissertation Year Fellowship (to T.A.).
Submitted on October 12, 2004; accepted for publication August 15, 2005.
| REFERENCES |
|---|
|
|
|---|
2. Stroud, R. M., and E. B. Fauman. 1995. Significance of structural changes in proteins: expected errors in refined protein structures. Protein Sci. 4:23922404.[Abstract]
3. Brunger, A. T. 1992. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature. 355:472475.[CrossRef]
4. Berger, B., J. Kleinberg, and T. Leighton. 1999. Reconstructing a three-dimensional model with arbitrary errors. J. ACM. 46:212235.[CrossRef]
5. Levitt, M., and M. Gerstein. 1998. A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA. 95:59135920.
6. Park, B., and M. Levitt. 1996. Energy functions that discriminate x-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258:367392.[CrossRef][Medline]
7. Carothers, J. M., S. C. Oestreich, J. H. Davis, and J. W. Szostak. 2004. Informational complexity and functional activity of RNA structures. J. Am. Chem. Soc. 126:51305137.[CrossRef][Medline]
8. Sullivan, D. C., T. Aynechi, V. A. Voelz, and I. D. Kuntz. 2003. Information content of molecular structures. Biophys. J. 85:174190.
9. Shannon, C. E. 1948. A mathematical theory of communication. Bell Sys. Tech. J. 27:379423,623656.
10. Aynechi, T., and I. D. Kuntz. 2005. An information theoretic approach to macromolecular modeling: II. Force fields. Biophys. J. 89:30083016.
11. Bonneau, R., and D. Baker. 2001. Ab initio protein structure prediction: progress and prospects. Annu. Rev. Biophys. Biomol. Struct. 30:173189.[CrossRef][Medline]
12. Vingron, M., and M. S. Waterman. 1994. Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol. 235:112.[CrossRef][Medline]
13. Dill, K. A., S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. Chan. 1995. Principles of protein foldinga perspective from simple exact models. Protein Sci. 4:561602.[Abstract]
14. Wroe, R., E. Bornberg-Bauer, and H. S. Chan. 2005. Comparing folding codes in simple heteropolymer models of protein evolutionary landscape: robustness of the superfunnel paradigm. Biophys. J. 88:118131.
15. Habeck, M., M. Nilges, and W. Rieping. 2005. Replica-exchange Monte Carlo scheme for Bayesian data analysis. Phys. Rev. Lett. 94:018105.[CrossRef][Medline]
16. Solis, A. D., and S. Rackovsky. 2000. Optimized representations and maximal information in proteins. Proteins. 38:149164.[CrossRef][Medline]
17. Needleman, S. B., and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443453.[CrossRef][Medline]
18. Smith, T. F., and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195197.[CrossRef][Medline]
19. Apostolico, A., and R. Giancarlo. 1998. Sequence alignment in molecular biology. J. Comput. Biol. 5:173196.[Medline]
20. Qian, B., and R. A. Goldstein. 2002. Optimization of a new score function for the generation of accurate alignments. Proteins. 48:605610.[CrossRef][Medline]
21. Altschul, S. F. 1998. Generalized affine gap costs for protein sequence alignment. Proteins. 32:8896.[CrossRef][Medline]
22. Koretke, K. K., Z. Luthey-Schulten, and P. G. Wolynes. 1996. Self-consistently optimized statistical mechanical energy functions for sequence structure alignment. Protein Sci. 5:10431059.[Abstract]
23. Lesk, A. M., M. Levitt, and C. Chothia. 1986. Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1:7778.
24. Benner, S. A., M. A. Cohen, and G. H. Gonnet. 1993. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229:10651082.[CrossRef][Medline]
25. Zachariah, M. A., G. E. Crooks, S. R. Holbrook, and S. E. Brenner. 2005. A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins. 58:329338.[CrossRef][Medline]
26. Qian, B., and R. A. Goldstein. 2001. Distribution of index lengths. Proteins. 45:102104.[CrossRef][Medline]
27. Strait, B., and T. Dewey. 1996. The Shannon information entropy of protein sequences. Biophys. J. 71:148155.
28. Cline, M. S., K. Karplus, R. H. Lathrop, T. F. Smith, R. G. Rogers, Jr., and D. Haussler. 2002. Information-theoretic dissection of pairwise contact potentials. Proteins. 49:714.[CrossRef][Medline]
29. Helling, R., H. Li, R. Melin, J. Miller, N. Wingreen, C. Zeng, and C. Tang. 2001. The designability of protein structures. J. Mol. Graph. Model. 19:157167.[CrossRef][Medline]
30. Lau, K. F., and K. A. Dill. 1990. Theory for protein mutability and biogenesis. Proc. Natl. Acad. Sci. USA. 87:638642.
31. Chothia, C., and A. M. Lesk. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823826.[Medline]
32. Gan, H. H., R. A. Perlow, S. Roy, J. Ko, M. Wu, J. Huang, S. Yan, A. Nicoletta, J. Vafai, D. Sun, L. Wang, J. E. Noah, S. Pasquali, and T. Schlick. 2002. Analysis of protein sequence/structure similarity relationships. Biophys. J. 83:27812791.
33. Sullivan, D. C., and I. D. Kuntz. 2001. Conformation spaces of proteins. Proteins. 42:495511.[CrossRef][Medline]
34. Irback, A., and E. Sandelin. 2000. On hydrophobicity correlations in protein chains. Biophys. J. 79:22522258.
35. Cui, Y., W. H. Wong, E. Bornberg-Bauer, and H. S. Chan. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc. Natl. Acad. Sci. USA. 99:809814.
36. Sullivan, D. C., and I. D. Kuntz. 2004. Distributions in protein conformation space: implications for structure prediction and entropy. Biophys. J. 87:113120.
37. Li, H., C. Tang, and N. S. Wingreen. 2002. Designability of protein structures: a lattice-model study using the Miyazawa-Jernigan matrix. Proteins. 49:403412.[CrossRef][Medline]
38. Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89:1091510919.
39. Dayhoff, M. O. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring, MD. 345352.
40. Gonnet, G. H., M. A. Cohen, and S. A. Benner. 1992. Exhaustive matching of the entire protein sequence database. Science. 256:14431445.
41. Overington, J., D. Donnelly, M. S. Johnson, A. Sali, and T. L. Blundell. 1992. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1:216226.[Abstract]
42. Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275282.
43. Blake, J. D., and F. E. Cohen. 2001. Pairwise sequence alignment below the twilight zone. J. Mol. Biol. 307:721735.[CrossRef][Medline]
44. Doolittle, R. F. 1986. Of Urfs and Orfs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA.
45. Mallick, P., D. Rice, and D. Eisenberg. DAPS: database of distant aligned protein structures. http://www.doe-mbi.ucla.edu/
parag/DAPS/,2001.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |