help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Aynechi, T.
Right arrow Articles by Kuntz, I. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aynechi, T.
Right arrow Articles by Kuntz, I. D.
Biophysical Journal 89:2998-3007 (2005)
© 2005 The Biophysical Society

An Information Theoretic Approach to Macromolecular Modeling: I. Sequence Alignments

Tiba Aynechi * and Irwin D. Kuntz {dagger}

* Graduate Group in Biophysics, and {dagger} Department of Pharmaceutical Chemistry, University of California, San Francisco, California

Correspondence: Address reprint requests to Dr. Irwin D. Kuntz, Dept. of Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, CA 94143-0446. Tel.: 415-476-1937; Fax: 415-502-1411; E-mail: kuntz{at}cgl.ucsf.edu.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are interested in applying the principles of information theory to structural biology calculations. In this article, we explore the information content of an important computational procedure: sequence alignment. Using a reference state developed from exhaustive sequences, we measure alignment statistics and evaluate gap penalties based on first-principle considerations and gap distributions. We show that there are different gap penalties for different alphabet sizes and that the gap penalties can depend on the length of the sequences being aligned. In a companion article, we examine the information content of molecular force fields.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Structural biology now has the challenge of providing structural and functional information on a genomic scale. Current methods combine different experimental and computational procedures to deduce the structure and function of biomacromolecules. A partial list includes sequence analysis, crystallography, magnetic resonance, spectroscopy, homology modeling, and molecular dynamics. However, despite the quantitative nature of such undertakings, there is no unifying model of information content and error analysis for the field as a whole. Although there have been important specialized forays (1Go–7Go), there is a need to seek a broader approach that would permit the evaluation and comparison of such methods. A further related concern is the additivity of information when different techniques are combined. Previously, we demonstrated (8Go) that the basic tenets of information theory (9Go) can be used to quantify the information content of distance constraints. In this and a companion article (10Go), we apply the same general principles to simple exact models (SEMs) to draw inferences about two important tools of computational biology, sequence analysis, and force fields.

Sequence alignment is an integral part of comparative modeling protocols. Aside from ab initio methods (11Go), theoretical structure prediction is generally approached in two steps:

  1. Given an amino acid sequence, find an appropriate structural template (using homology modeling and/or threading).
  2. Refine the structural model to produce an energetically minimized or best-scoring conformation.
The first step requires sequence-alignment algorithms, which rely heavily on the use of empirical parameters such as gap penalties and scoring matrices (12Go). Because sequence space is poorly characterized, it is difficult to either evaluate or improve overall performance except in the context of specific training sets. What is needed is a unified picture of the fundamental issues.

A standard way to gain insight into complex problems is through SEMs. In the protein-folding field, these models use simplified representations of sequences and structures to mimic sequence and structure interactions in real systems. Thus, self-avoiding two-dimensional lattice walks and simplified alphabets have long been used to evaluate and understand the principles of protein folding (13Go,14Go). The ability to exhaustively enumerate all states of the system affords the opportunity to describe the system's behavior unambiguously, and it can provide a clear path relationship between assumptions and consequences. Thus, SEMs are well suited for formulating and evaluating general concepts: a task that may be much more difficult with real-world examples because of their heavy parameter dependence and need for approximations (15Go).

In this article, we combine the use of simplified systems with information theory to derive the costs of alignment procedures, scoring matrices, and gap penalties of idealized models. We then consider the applicability of the insights gained to the current approaches to sequence alignment. As noted above, our modeling choices are chosen to illuminate the underlying issues. We consider force fields in the companion article (10Go).


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Overview
Our basic approach is to explore a simple exact model where it is possible to write out all occurrences of the set of interest (i.e., all possible sequences) and ask what the informational consequences are of performing an operation that combines some of the objects. The information required to select one object from a set of W objects is log2W (9Go). The normal use of sequence-alignment procedures is considered to increase the information associated with a given probe sequence. That is, one queries a database of sequences and assigns properties (structure, function) to the probe sequence based on the statistically valid matches that are found. This information increase arises because a single sequence is placed into a cluster smaller than the full set of sequences from which it was indistinguishable before the alignment procedure. However, we can equally well treat alignment as a clustering procedure in which a number of sequences are grouped together as indistinguishable. Such clustering reduces the effective number of distinguishable objects compared to the full set of unique sequences. From this point of view, information is lost because a number of sequences that were treated as distinct from one another are now considered the same (i.e., similar sequence, function, or structure). In this context, gap penalties are a direct reflection of the price that must be paid for the information loss.

Of course, it is not feasible to write out all sequences for all proteins and nucleic acids. Nor can we advance a comprehensive model for the evolutionary and structural constraints that give rise to the sequences that form the current pan-genomic database. Rather, our strategy will be to uncover general properties by making use of model systems and simplified alphabets (14Go,16Go). However, we are also interested in exploring the implications of such models for the real world of sequences and conformations. This relationship is not a formal part of information theory, and will involve additional assumptions or hypotheses, the truth of which must be established by other methods. For example, it is straightforward to evaluate the informational consequences of the proposition that the known sequences are a random subset of all possible sequences; this proposal can be directly tested statistically, but information theory, alone, cannot determine its validity.

Shannon information of a set W
The information required to select an individual entity from a set of W objects is defined as

(1)
IS is referred to as the source information (9Go). Given a metric set, M, that partitions the objects into subsets, the information content can be measured in bits using Shannon's formulation (9Go),

(2)
in which pk is the population of cluster k expressed as a fraction of the ensemble, summed over all clusters. These clusters are subsets of the population that are indistinguishable under particular assumptions or constraints.

As mentioned previously, clustering can lead to a change in information. We relate Eqs. 1 and 2 to yield the information gain/loss of a clustering procedure as

(3)

Sequence alignment
Overview
Sequences of proteins or nucleic acids of unknown structure and function are sources of information through association with other sequences whose functions/structures are already known. The most widely used associative process is alignment. Alignment algorithms can be divided into two categories, global and local. A global alignment (17Go) looks for the best overall similarity among sequences, whereas a local alignment (18Go) searches for similar sub-sequences between two proteins. Both of these algorithms make use of a variety of scoring matrices and gap penalties (19Go–24Go). Sequence-alignment problems are underdetermined, having multiple optimal solutions depending on the parameters used. Thus far, there has not been a quantitative analysis of the parameter dependence, one reason being the absence of a standard comparison metric. With an information theoretic approach, we are able to formalize the effects of parameters such as sequence length, alphabet size, etc. We consider the sets of sequences of length N, drawn from an alphabet of A characters. Assume that the characters are used with equal frequency (effects of character correlation can be readily included at a later stage). With this simplifying assumption, each sequence has equal weight and there are AN unique sequences. The information content of the set is simply Nlog2A. Alignment procedures require the definition of a template of length T < N. The template may contain gaps—that is, the string for the template may contain one or more positions that match any character. Alternatively, the template may be considered continuous and the sequences with which it is being compared can contain gaps. We ask how many sequences of an exhaustive list match a specific template. Most generally, because there is nothing of special interest for any given template, we are interested in the information content averaged over all templates of a certain type.

We begin with the case of gapless pairwise alignments and then move to multigapped alignments. We will use both exhaustive and stochastic data sets, along with simple alignment models, to provide insight into the informational issues associated with sequence comparisons.

Statistics from alignments are gathered under two scenarios:

  1. For every sequence in the data set, a single (first) occurrence of the template, T, is sufficient for assignment.
  2. All possible occurrences of a template are sought in each sequence of the data set (multiple-occurrence model).

Gapless alignments
For an A-letter alphabet, the total number of possible N-letter sequences is AN,

(4)
In the simplest case, we consider those template sequences of length T whose elements are found in contiguous positions in probe sequences of length N. Defining K = N–T, the templates can be anchored in K+1 positions each with AK possible matches, leading to an estimate of (K+1)AK sequences if there are no duplicate sequences. Consequently, the information required to distinguish among the ungapped matches in an exhaustive set is

(5)
This formula counts exactly all occurrences of the template in complete (multiple occurrence) alignments. For a two-letter alphabet consisting of zeroes and ones, the templates are of the form 01, 001, 0001, .... Templates of higher symmetry, e.g., 000 or 010, have somewhat fewer hits (data not shown). Using the asymmetric templates gives the maximum number of hits, which also corresponds to the numerical results from the formulas.

For single-occurrence, ungapped alignments, we have an alternative approach using 1), the contiguous string; and 2), the standard formula for the probability of failure to match, pF. Given the probability of occurrence of the template in a single sequence, pT = (1/A)T, and the number of independent attempts (K+1), pF = (1 – pT)K+1. The probability of a hit for a sequence, pH, is then (1 – pF) and the total number of hits for the set is AN x pH,

(6)
This formula assumes that multiple occurrences of a template occur with an equal probability, pT, in a given sequence. However, once a template appears once in a sequence, it fixes the positions it occupies and the probabilities of any subsequent overlapping templates will no longer be independent. As a result, this formula may either underestimate or overestimate the hit count depending on the template type. In the case of the templates used here (01, 0001,...), the formula underestimates. This effect is lessened as the template/sequence length ratio becomes larger, because of the diminishing possibility of overlaps.

However, Eq. 6 provides useful values for I for the full range of K for single occurrence of templates (see Results and Discussion, below).

Gapped alignments
For more general gap distributions, in which all templates of length T = N – K are aligned against a probe of length N, we need to consider the combinatorial arrangement of gaps of varying length. For gaps of minimum-length one, there will be C (N, K) = N!/K!(NK)! ways to arrange the gaps in an N-long sequence. However, if we require the minimum-gap size, Gmin, to be greater than one, then the effective length of the sequence is reduced to Neffective = NK x Gmin + K. There are AK sequences for each arrangement:

Results for Gmin = 1 are exact for complete alignments (see Results and Discussion).

We have also found a formulation leading to an exact solution for the number of gapped matches for single occurrences of the template. The number of hits to match a given template of length T, where K = NT, against an exhaustive set of sequences becomes

(8)
This equation (discovered empirically from the counting data) provides exact counts over the complete range of N, T. When converting to bits of information, the right-hand side of Eq. 8 generally cannot be reduced to a simpler form; however, when A > 2 and T < 95% of N, the highest order term sufficiently dominates so that the summation is no longer needed. Under these circumstances, the information is, to a good approximation,

(9)

Gap penalties
The formulas above quantify the amount of information associated with successful alignments when an exhaustive basis set of all possible sequences is available. They also can be used to set bounds on gap penalty values (see Results and Discussion).

Gap penalties can be derived by examining the length distributions of gaps in systems where structural alignment is possible. This approach is based on a general affine model of gap penalties (25Go) and uses a geometric distribution to assess the probability distribution of gaps, yielding, in the Qian and Goldstein treatment (26Go), the formula for gap initiation ({gamma}I) and gap extension ({gamma}E) penalties of

(10)
Here, Pg is the probability of opening a gap, and {lambda} is the half-decay length of the gap length distribution. The values for Pg and {lambda} are determined in a similar way to Qian and Goldstein (26Go). For a given sequence-length and template, we plot the distribution of gap lengths versus the probability for the observed hits (as an example, see Fig. 1). We then fit the data to an exponential of the form

(11)
where n is length of the gap and B is defined as B = Pg x exp(–1/{lambda})/[1–exp(–1/{lambda})].



View larger version (20K):
[in this window]
[in a new window]
 
FIGURE 1  Calculation of {lambda} and Pg from gap distributions. Sample distribution of gaps for the first-occurrence model, using a stochastic set of 50-mers with a six-letter alphabet, using a 10-mer template. The data is fit (solid line) to the form p(n) = B x exp(–n/{lambda}), for this case {lambda} = 3.833, B = 0.30298. According to the formulation of Qian and Goldstein (20Go), B = Pg x exp(–1/{lambda})/[1–exp(–1/{lambda})]. We calculate Pg by substituting {lambda} back into the expression for B to obtain gap initiation and extension penalties according to Eq. 10.

 
To compare our exhaustive reference-state distributions to previously determined values, we use our counting experiments to measure the distribution of gap lengths for sets of sequences and templates of varying length. We then use the Qian and Goldstein equation (10Go) to calculate gap initiation ({gamma}I) and extension({gamma}E) penalties.

Search algorithm methods
First-occurrence alignments
For every sequence, S, in the set, given a template T, we look for the first occurrence of the symbol in the first position of the template. Looking forward, we search for the first instance of the symbol in the second position of the template and so forth, until the last position in T. If a symbol is not observed in S in order of appearance in T, the search is terminated. Indices of each hit in the sequence are tabulated to determine the length of the gap among instances of each symbol present in the template.

Multiple-occurrence alignments
Fig. 2 shows how the occurrences of a template T are sought in a sequence, S, consisting of an alphabet of size A. The final list contains all the occurrences of T in S by specifying the indices of the symbols in S. The positional indices for each occurrence are used to determine the distribution of gap lengths.



View larger version (34K):
[in this window]
[in a new window]
 
FIGURE 2  Search algorithm for finding all occurrences of a template, T, in a sequence S, using alphabet A.

 
Exhaustive versus incomplete sequence sets
Mapping
We turn to the question of how to compare results from the exhaustive list of sequences with those generated from a (sub)set of observed sequences. There are several issues. First, the set of observed sequences is not fixed but is continually updated with new sequences being added and old sequences being modified or even deleted. For our purposes, we will ignore such issues and simply take a snapshot of the existing data. A second concern is that the observed sequences show unequal utilization of the characters. Such variable weightings were part of Shannon's initial formulation and Eq. 2 yields a single correction term equal to 0.12 bits/amino acid, based on the nonuniform composition of amino acids in real proteins (27Go). Higher-order terms dealing with joint probability of multiple characters can also be considered as corrections to the simple assumption of equal frequencies (28Go). In the same way, scoring matrices with partial weights for alternative characters reduce the effective alphabet below the limit set by equal utilization of all characters. A correction term can be generated for any scoring matrix of interest.

A harder question is the relationship of the observed sequences to the exhaustive set. Many hypotheses can be put forward about the mechanisms of evolution and the types of structural constraints imposed upon both nucleic acids and proteins. We do not propose in this article to select among them. Instead, we provide simple examples to illustrate how the mapping from exhaustive sequences to sequence subsets changes the information content of alignment operations and hence changes the values of gap penalties. These simple hypotheses are:

  1. The observed sequences are a random subset of the exhaustive sequences.
  2. The observed sequences are a particular evolutionary subset of the exhaustive sequences.
Again, our purpose is not to espouse these models, but to show how Eqs. 59 are modified in each case.

To examine the information content of a random subset of the exhaustive sequences, we generated sets of 10,000–100,000 random sequences of lengths N = 10, 20, 30 for A = 2. These were scanned with templates of various lengths and gap lengths. The number of hits was recorded with each of the sequences as a starting point and the probabilities of clustering were calculated. The information content for each alignment procedure was tabulated.

To generate a simple evolutionary model, we used the constraint that L of the N positions would not vary. This assumption produces a subset of sequences that are in exact correspondence to sequences from the exhaustive list for N' where N' = NL. Equations 5 9 can then be applied directly to this subset.

Correlation of sequence alignment and conformational resolution
Of course, one random model and one simple evolutionary model just begin to explore the sequence constraints operating on the natural sequences. Presumably, one of the critical limits is that most of the experimental sequences arise from sequence subsets that provide stable three-dimensional (tertiary) structures for some range of physical variables. We have not attempted to construct such a model in this article, but others have approached the problem (29Go,30Go).

In a seminal article, Chothia and Lesk (31Go) provided the first quantitative relationship between sequence identity and structural similarity. In recent work, this relationship has been revisited in great detail (32Go). For our purposes, we use the methods above and those of Sullivan and Kuntz (33Go) to compare the information content of sequence and structural alignments, as follows. As a model of real proteins, we choose the backbone conformations for 100-mer compact polyalanine chains, a 20-letter alphabet, and a multiple-occurrence gapped alignment model. For a given homology level, we then compare the information of sequence and structural alignments. For our example, we select a similarity level of 90%. We use Eqs. 2, 3, and 7 to determine the information from sequence alignments. From Chothia and Lesk, a 90% identity yields a root-mean-square deviation (RMSD) of ~0.5 Å. Therefore, we ask: What is the conformational probability of 100-mer compact polyalanine models falling within 0.5 Å RMSD, as derived by Sullivan and Kuntz, to determine the information required for a corresponding structural alignment?


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have studied two alignment models, each treating ungapped and gapped templates. We have both analytic formulas and statistical results for the information. In addition, we have collected statistics on gap frequencies and the probability distribution of gap lengths. As noted earlier, Eqs. 5 and 6 describe the ungapped data exactly for multiple hits (Table 1) and within an average of 3% for single-occurrence hits (Table 2), respectively. Tables 3 and 4 show that Eqs. 7 and 8 provide an exact numerical result for gapped alignments for multiple and first-occurrence models.


View this table:
[in this window]
[in a new window]
 
TABLE 1  Gapless alignments for exhaustive sequence sets—multiple-occurrences as described by Eq. 5, verified numerically from the counting data

 

View this table:
[in this window]
[in a new window]
 
TABLE 2  Gapless alignments for exhaustive sequence sets—first occurrence as described by Eq. 6, verified numerically from the counting data

 

View this table:
[in this window]
[in a new window]
 
TABLE 3  Gapped alignments for exhaustive sequence sets—multiple-occurrences as described by Eq. 7, verified numerically from the counting data

 

View this table:
[in this window]
[in a new window]
 
TABLE 4  Gapped alignments for exhaustive sequence sets—first-occurrence model as described by Eq. 8, verified numerically from the counting data

 
One of our primary concerns is the implication of these equations for gap penalties. We can get estimates of these penalties by examining the equations directly, or we can calculate the distribution of gap lengths. Equations 59 contain terms sensitive to K (the total length of all gaps), as well as terms that depend on the size of the alphabet, A, and the length of the sequence, N. These formulas cannot be separated cleanly into gap initiation terms and gap extension terms. However, they are generally consistent with a gap penalty that costs information at the rate of log2A per unit of gap length (K). The full loss (initiation + extension) at K = 1 is log2(A x N). For N = 100, such a penalty would be equivalent to –6.0 for A = 4 and –7.6 for A = 20 in the units normally used for sequence-alignment programs (i.e., ln A). These values are model-dependent. We also note that, for equivalent coding, the nucleic acid model, A = 4, would have an N of 300, yielding a penalty term of –7.1. These results are in reasonable agreement with the range of empirical gap initiation penalties reported by Qian and Goldstein (20Go) (see Fig. 3).



View larger version (46K):
[in this window]
[in a new window]
 
FIGURE 3  Distribution of gap initiation and extension penalties. Medium hashed bars designate gap initiation; solid bars, gap extension. Bars labeled Info Theory represent values derived from our gap distributions using Eq. 10. Dense hash bars indicate gap initiation + gap extension approximated using Eqs. 59. Data for the following were taken from Qian and Goldstein (20Go): BLOSUM62, BLOSUM30 (38Go); PAM250, PAM350, PAM500 (39Go); GCB (40Go); STR (41Go); JTT(42Go); BC0030 (43Go); OPTIMA (44Go); D-BL25(45Go); and STROMA (20Go).

 
We can also examine the probability distribution of gap lengths (26Go). We gathered these data either from short exhaustive binary sequences or from stochastic samples of longer sequences with larger alphabets. There are two ways that we can count gap frequencies and gap lengths (see Methods, above). First, for the equations given above, we have used a first-occurrence model in which the gap-length data are taken from the initial successful match of a template to a sequence. Alternatively, we can identify all matches of a template with a specific sequence, and for each match, tabulate the gap-length information (multiple-occurrence) model. This model appears to be closer to the empirical data reported by Qian and Goldstein (26Go).

Our basic findings are:

  1. Most of the gap-length distributions can be approximated by an exponential, but those arising from larger alphabets and longer templates clearly have a more complex character. The distributions can be numerically fit as multiple exponentials similar to those found by Qian and Goldstein for sequence alignments of proteins of known structures. They can also be fit with polynomial functions. It is not obvious if these expanded functions carry any physical significance.
  2. For the first-occurrence model, the exponential decay increases strongly with alphabet size (Table 5). However, for the multiple-occurrence model, the exponential decay is independent of alphabet size, although it increases with N and decreases with T (Table 6). If we use the single-exponential approximation and the treatment of Qian and Goldstein (see Eq. 10), we get the range of gap penalties shown in Fig. 3. Our values are consistently on the low end of the empirical range.
To continue the comparison with the values in the literature (20Go), we return to the relationship between the set of observed sequences and the exhaustive reference state based on the two scenarios described earlier. Random models are easily constructed and tested. The stochastically derived probabilities for the alignments of random sequences show no surprises and are equal to those from the exhaustive set of sequences within the expected statistical variation (Table 7). Evolutionary models for the observed sequences can also be constructed. Two explicit models would be: 1), use of a full alphabet for a subset of the sequence positions with the other positions fixed; and 2), restricted alphabets at all sequence positions. In the former case, we would expect Eqs. 59 to apply directly, but with a reduced chain length (see Methods); in the latter, Eq. 2 can be used to compensate for the unequal probabilities of each character. To test the first evolutionary model numerically, we took the exhaustive set of fully variable 15-mers embedded in 20-mers with the first five L-positions invariant. The results (Fig. 4) for first-occurrence gapped hits closely correspond to the exhaustive 15-mer data, suggesting that Eqs. 8 and 9 are good approximations for sequences generated by evolutionary relationships. However, when we compute the multiple-occurrence gap-length distributions for the same data set, the situation is more complicated. The distribution functions are intermediate between the 15-mer and 20-mer data, with the distributions closer to the 20-mer results (Fig. 5). Others have also studied the evolutionary relationships among protein and model sequences using simple models. For example, Irback and Sandelin (34Go) show that real sequences deviate from random sequences in the distribution of hydrophobic residues in the chains, whereas Cui et al. (35Go) show that crossovers and nonhomologous combinations are favored in the evolution of low energy states. However, neither study explores information content for their systems.


View this table:
[in this window]
[in a new window]
 
TABLE 5  Gap distributions and gap penalties for the first-occurrence model

 

View this table:
[in this window]
[in a new window]
 
TABLE 6  Gap distributions and gap penalties for the multiple-occurrence model

 

View this table:
[in this window]
[in a new window]
 
TABLE 7  Data comparing the hit probabilities from exhaustive and stochastic sequence sets for A = 2, N = 20

 


View larger version (25K):
[in this window]
[in a new window]
 
FIGURE 4  Average number of gapped hits for all possible 5-mer templates in 32 related 20-mer evolutionary subsets (N = 20, L = 5) using the single-occurrence model. Average hits for the 15-mer and 20-mer exhaustive sets are shown on the axis ends.

 


View larger version (15K):
[in this window]
[in a new window]
 
FIGURE 5  Gap-length distributions in multiple-occurrence alignments for a 20-mer evolutionary subset (N = 20, L = 5). All possible 5-mer binary templates are aligned against the sequence set resulting in 32 different distributions represented with black solid lines. For comparison we also show the behavior of the 5-mer templates in 20-mer (–+–) and 15-mer (–{nabla}–) exhaustive sets. Some of the distributions from the evolutionary set are almost identical to the distributions from the 15-mer and 20-mer exhaustive sets.

 
The other question raised above is what type of gap simulation best captures normal alignment procedures, as for example, in the Needleman-Wunsch algorithm. The essential issue is that real-world data are drawn from a highly heterogeneous sequence set. The sequences and templates are of variable lengths, and the alphabets, although nominally of four or 20 letters, have unequal utilization of the letters in a sequence/structure-dependent manner. Furthermore, the results depend on whether first-occurrence or multiple-occurrence statistics are used. Thus, it seems unlikely that there is a best set of gap penalties for all alignment problems.

The empirical gap penalties currently in use are obtained from training sets on homologous proteins. Our approach in this article has been to explore simple, exhaustive treatments of sequence alignment with a much broader reference state. We show that these efforts lead directly to a priori gap penalties that depend on models of the protein and nucleic-acid sequence universe. Our formulas suggest alphabet-size and sequence-length dependencies that are not included in current methods. They lead directly to two practical suggestions:

  1. Gap penalties should differ for nucleic acid versus amino-acid sequence alignments.
  2. It would be useful to generate sequence-length-dependent penalty corrections.
We also have examined gap-occurrence probabilities, finding that they (approximately) follow a geometric distribution.

Most of our results are at the low end of reported range of gap penalties. One reasonable explanation is that the set of known sequences is significantly nonrandom because of some combination of evolutionary and structural constraints; for example, reduced gap probabilities inside secondary structure elements, which would lead to higher-than-random gap penalties for gaps in loops. Although we cannot say that gap penalties based on SEMs will lead to better alignments than current methods, we hope that exploring the underlying relationships in simple systems will lead to improved understanding of the basic principles.

At a higher level, we can use the machinery set up above to ask how the information content of sequence alignment compares to the information content of structural alignments. To do this we draw on the basic formulas derived above. We also use the work of Chothia and Lesk (31Go), which establishes the relationship between sequence identity and structural variation, and the article of Sullivan and Kuntz (36Go), which gives log odds probabilities of structural variation. For this purpose, we treat a specific example of a compact 100 polyalanine backbone conformations because the Chothia-Lesk relation applies to native proteins. We ask what is the information content of a 90% identity match using a 20-letter, equal frequency, alphabet. From Eq. 7 we get 87 bits of required information (IM) for a 90% identity alignment. The full protein requires 432 bits (IS) (Eqs. 1 and 2), so the information gain from a 90% alignment is 345 bits (Eq. 3). A correction for the known frequency of use of the amino acids is ~–17 bits, so that a 90% homology match using realistic frequencies contains something approaching 3.5 bits/residue of information. From Chothia and Lesk, as well as later work (32Go), we see that 90% homology implies a structural variance of ~0.5 Å. Using the data from the cumulative distribution function of Sullivan and Kuntz (36Go) and Eq. 2 of this article, we can estimate how much information is required to achieve an RMSD of 0.5 Å for a stochastic population of compact polyalanine chains. We get ~2.4–2.5 bits/residue required to select this RMSD distribution from a stochastic set of compact chains. This calculation indicates that, roughly speaking, high-end sequence alignment combined with homology modeling could approach the quality of direct structural measurements for determining backbone geometries. It also implies that the designability hypothesis (i.e., many sequences per structure) derived from lattice models (37Go) can also be supported from information content assessments of off-lattice conformational estimates.

In the companion article following, we extend these calculations to include comparison with force fields as well, showing how the use of information theory allows direct comparison of quite diverse techniques.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are grateful for helpful discussions with Scott Pegg and Kevin Masukawa.

This work was supported by a grant from the National Science Foundation (No. CHE-0118481), R. Kip Guy, Principal Investigator; a National Institutes of Health grant (No. RR019864), C. Pancerella, Principal Investigator; and a University of California President's Dissertation Year Fellowship (to T.A.).

Submitted on October 12, 2004; accepted for publication August 15, 2005.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
1. Luzzati, V. 1952. Statistical treatment of errors in the determination of crystalline structures. Acta Crystallogr. 5:802–810.[CrossRef]

2. Stroud, R. M., and E. B. Fauman. 1995. Significance of structural changes in proteins: expected errors in refined protein structures. Protein Sci. 4:2392–2404.[Abstract]

3. Brunger, A. T. 1992. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature. 355:472–475.[CrossRef]

4. Berger, B., J. Kleinberg, and T. Leighton. 1999. Reconstructing a three-dimensional model with arbitrary errors. J. ACM. 46:212–235.[CrossRef]

5. Levitt, M., and M. Gerstein. 1998. A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA. 95:5913–5920.[Abstract/Free Full Text]

6. Park, B., and M. Levitt. 1996. Energy functions that discriminate x-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258:367–392.[CrossRef][Medline]

7. Carothers, J. M., S. C. Oestreich, J. H. Davis, and J. W. Szostak. 2004. Informational complexity and functional activity of RNA structures. J. Am. Chem. Soc. 126:5130–5137.[CrossRef][Medline]

8. Sullivan, D. C., T. Aynechi, V. A. Voelz, and I. D. Kuntz. 2003. Information content of molecular structures. Biophys. J. 85:174–190.[Abstract/Free Full Text]

9. Shannon, C. E. 1948. A mathematical theory of communication. Bell Sys. Tech. J. 27:379–423,623–656.

10. Aynechi, T., and I. D. Kuntz. 2005. An information theoretic approach to macromolecular modeling: II. Force fields. Biophys. J. 89:3008–3016.[Abstract/Free Full Text]

11. Bonneau, R., and D. Baker. 2001. Ab initio protein structure prediction: progress and prospects. Annu. Rev. Biophys. Biomol. Struct. 30:173–189.[CrossRef][Medline]

12. Vingron, M., and M. S. Waterman. 1994. Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol. 235:1–12.[CrossRef][Medline]

13. Dill, K. A., S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. Chan. 1995. Principles of protein folding—a perspective from simple exact models. Protein Sci. 4:561–602.[Abstract]

14. Wroe, R., E. Bornberg-Bauer, and H. S. Chan. 2005. Comparing folding codes in simple heteropolymer models of protein evolutionary landscape: robustness of the superfunnel paradigm. Biophys. J. 88:118–131.[Abstract/Free Full Text]

15. Habeck, M., M. Nilges, and W. Rieping. 2005. Replica-exchange Monte Carlo scheme for Bayesian data analysis. Phys. Rev. Lett. 94:018105.[CrossRef][Medline]

16. Solis, A. D., and S. Rackovsky. 2000. Optimized representations and maximal information in proteins. Proteins. 38:149–164.[CrossRef][Medline]

17. Needleman, S. B., and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443–453.[CrossRef][Medline]

18. Smith, T. F., and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–197.[CrossRef][Medline]

19. Apostolico, A., and R. Giancarlo. 1998. Sequence alignment in molecular biology. J. Comput. Biol. 5:173–196.[Medline]

20. Qian, B., and R. A. Goldstein. 2002. Optimization of a new score function for the generation of accurate alignments. Proteins. 48:605–610.[CrossRef][Medline]

21. Altschul, S. F. 1998. Generalized affine gap costs for protein sequence alignment. Proteins. 32:88–96.[CrossRef][Medline]

22. Koretke, K. K., Z. Luthey-Schulten, and P. G. Wolynes. 1996. Self-consistently optimized statistical mechanical energy functions for sequence structure alignment. Protein Sci. 5:1043–1059.[Abstract]

23. Lesk, A. M., M. Levitt, and C. Chothia. 1986. Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1:77–78.[Free Full Text]

24. Benner, S. A., M. A. Cohen, and G. H. Gonnet. 1993. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229:1065–1082.[CrossRef][Medline]

25. Zachariah, M. A., G. E. Crooks, S. R. Holbrook, and S. E. Brenner. 2005. A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins. 58:329–338.[CrossRef][Medline]

26. Qian, B., and R. A. Goldstein. 2001. Distribution of index lengths. Proteins. 45:102–104.[CrossRef][Medline]

27. Strait, B., and T. Dewey. 1996. The Shannon information entropy of protein sequences. Biophys. J. 71:148–155.[Abstract/Free Full Text]

28. Cline, M. S., K. Karplus, R. H. Lathrop, T. F. Smith, R. G. Rogers, Jr., and D. Haussler. 2002. Information-theoretic dissection of pairwise contact potentials. Proteins. 49:7–14.[CrossRef][Medline]

29. Helling, R., H. Li, R. Melin, J. Miller, N. Wingreen, C. Zeng, and C. Tang. 2001. The designability of protein structures. J. Mol. Graph. Model. 19:157–167.[CrossRef][Medline]

30. Lau, K. F., and K. A. Dill. 1990. Theory for protein mutability and biogenesis. Proc. Natl. Acad. Sci. USA. 87:638–642.[Abstract/Free Full Text]

31. Chothia, C., and A. M. Lesk. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823–826.[Medline]

32. Gan, H. H., R. A. Perlow, S. Roy, J. Ko, M. Wu, J. Huang, S. Yan, A. Nicoletta, J. Vafai, D. Sun, L. Wang, J. E. Noah, S. Pasquali, and T. Schlick. 2002. Analysis of protein sequence/structure similarity relationships. Biophys. J. 83:2781–2791.[Abstract/Free Full Text]

33. Sullivan, D. C., and I. D. Kuntz. 2001. Conformation spaces of proteins. Proteins. 42:495–511.[CrossRef][Medline]

34. Irback, A., and E. Sandelin. 2000. On hydrophobicity correlations in protein chains. Biophys. J. 79:2252–2258.[Abstract/Free Full Text]

35. Cui, Y., W. H. Wong, E. Bornberg-Bauer, and H. S. Chan. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc. Natl. Acad. Sci. USA. 99:809–814.[Abstract/Free Full Text]

36. Sullivan, D. C., and I. D. Kuntz. 2004. Distributions in protein conformation space: implications for structure prediction and entropy. Biophys. J. 87:113–120.[Abstract/Free Full Text]

37. Li, H., C. Tang, and N. S. Wingreen. 2002. Designability of protein structures: a lattice-model study using the Miyazawa-Jernigan matrix. Proteins. 49:403–412.[CrossRef][Medline]

38. Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89:10915–10919.[Abstract/Free Full Text]

39. Dayhoff, M. O. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring, MD. 345–352.

40. Gonnet, G. H., M. A. Cohen, and S. A. Benner. 1992. Exhaustive matching of the entire protein sequence database. Science. 256:1443–1445.[Abstract/Free Full Text]

41. Overington, J., D. Donnelly, M. S. Johnson, A. Sali, and T. L. Blundell. 1992. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1:216–226.[Abstract]

42. Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275–282.[Abstract/Free Full Text]

43. Blake, J. D., and F. E. Cohen. 2001. Pairwise sequence alignment below the twilight zone. J. Mol. Biol. 307:721–735.[CrossRef][Medline]

44. Doolittle, R. F. 1986. Of Urfs and Orfs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA.

45. Mallick, P., D. Rice, and D. Eisenberg. DAPS: database of distant aligned protein structures. http://www.doe-mbi.ucla.edu/~parag/DAPS/,2001.





This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Aynechi, T.
Right arrow Articles by Kuntz, I. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aynechi, T.
Right arrow Articles by Kuntz, I. D.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2005 by the Biophysical Society.