help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Giuliani, A.
Right arrow Articles by Colosimo, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Giuliani, A.
Right arrow Articles by Colosimo, A.

Biophys J, January 2000, p. 136-149, Vol. 78, No. 1

Nonlinear Methods in the Analysis of Protein Sequences: A Case Study in Rubredoxins

Alessandro Giuliani,* Romualdo Benigni,* Paolo Sirabella,dagger Joseph P. Zbilut,Dagger and Alfredo Colosimodagger

 *TCE Laboratory, Istituto Superiore di Sanitá, 00161 Roma, Italy;  dagger Department of Biochemical Sciences, University of Roma "La Sapienza," 00185 Roma, Italy; and  Dagger Department of Molecular Biophysics and Physiology, Rush University, Chicago, Illinois 60612 USA

    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

Two computational methods widely used in time series analysis were applied to protein sequences, and their ability to derive structural information not directly accessible through classical sequence comparisons methods was assessed. The primary structures of 19 rubredoxins of both mesophilic and thermophilic bacteria, coded with hydrophobicity values of amino acid residues, were considered as time series and were analyzed by 1) recurrence quantification analysis and 2) spectral analysis of the sequence major eigenfunctions. The results of the two methods agreed to a large extent and generated a classification consistent with known 3D structural characteristics of the studied proteins. This classification separated in a clearcut manner a thermophilic protein from mesophilic proteins. The classification of primary structures given by the two dynamical methods was demonstrated to be basically different from classification stemming from classical sequence homology metrics. Moreover, on a more detailed scale, the method was able to discriminate between thermophilic and mesophilic proteins from a set of chimeric sequences generated from the mixing of a mesophilic (Rubr Clopa) and a thermophilic (Rubr Pyrfu) protein. Overall, our results point to a new way of looking at protein sequence comparisons.

    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

The relationship of the physiological role of proteins with their primary structure is a crucial issue in molecular biology. It is well known that both the 3D structure and the physiological role of proteins is heavily dependent upon the particular linear arrangements of amino acid residues along polypeptide chains, or primary structures (Anfinsen, 1973; Dayhoff et al., 1978; Sweet and Eisenberg, 1983). Despite the recent progress in the field, fully documented in the series of Critical Assessment of Techniques of Proteins Structure Prediction (CASP) meetings (Murzin and Patthy, 1999), attempts to derive general rules in predicting 3D structure and physiological features based on protein sequences cannot be considered fully satisfactory as yet (Micheletti et al., 1998). Thus, we decided to tackle this problem with a local approach: instead of looking for general rules, we tried to develop a methodology able to derive local structure/sequence relationship models, with the ultimate goal of predicting physiological properties by means of sequence information within properly selected sets of proteins. This approach is similar to one used in medicinal chemistry (Hansch, 1993; Martin, 1981), where quantitative models for predicting pharmacological properties of organic molecules from their physicochemical and structural properties have been routinely used for the past three decades (Hansch et al., 1962; Hansch and Leo, 1995). The success of quantitative structure activity relationship (QSAR) models has been demonstrated (Martin, 1981; Hansch and Leo, 1995; Hansch et al., 1996) to be closely linked to the use of strictly congeneric molecules, i.e., of the same class.

From an operational point of view, a QSAR procedure is based on the correct selection of two components in a prediction model: 1) a meaningful set of "x" variables (regressors of the model, descriptors spanning the physicochemical space in which the molecules are located); and 2) an adequate statistical technique to quantitatively tackle the problem of both physicochemical space description and biological activity prediction (Martin, 1981; Franke, 1984). In the present case, the above elements are 1) hydrophobicity, as the physicochemical descriptor of amino acid residues, and 2) time series analysis as a general strategy to describe the proteins' primary structures. [We note that from a practical standpoint, spatially ordered series are equivalent to time-ordered series for the purpose of analysis; see (Zbilut et al., 1998a)].

The relevance of hydrophobicity for protein folding is well known (Anfinsen, 1973; Li et al., 1997; Micheletti et al., 1998; Sweet and Eisenberg, 1983). Additionally, hydrophobicity is the only physicochemical feature that displays a nonrandom ordering along protein sequences (Weiss and Herzel, 1998; von Heijne, 1982), and may thus be considered the best candidate for application of time series analysis methods. The amino acid residues along protein chains were coded in terms of their hydrophobicity, expressed as log P, P being the partition coefficient between octanol and water (Franke, 1984). Thus, our analysis focused on the hydrophobicity series with the idea of using hydrophobicity as the "semantics" attached to the residues' ordering along a chain. [The semantic character of a physicochemical property should reflect both energetic interchange with elements present in the environment (e.g., water molecules) and entropic rearrangements induced in them.] An additional goal was to check whether our approach was able to generate different, complementary information for protein classification, with respect to standard sequence comparison methods (Pearson and Lipman, 1988; Wilbur and Lipman, 1983; Thompson et al., 1997).

With respect to computational aspects, the ability of recurrence quantification analysis (RQA) to deal with a sequence/function relationship problem has been recently demonstrated (Zbilut et al., 1998b), while almost at the same time, Mandell and his co-workers (1997; 1998; Selz et al., 1998) reported that singular value decomposition (SVD) in concert with spectral analysis might be able to provide useful structural 3D information from sequence data. Thus, we combined the two approaches to obtain a global representation of hydrophobicity patterns along a sequence. These two main methods were supplemented with three other relatively simple descriptors, namely 1) standard deviation (SD), 2) absolute value of the Pearson correlation coefficient between adjacent residues (R), and 3) algorithmic complexity of the series (Kaspar and Schuster, 1987) as estimated by the Lempel-Ziv algorithm (LZ).

The entire set of descriptors was filtered by principal component analysis (PCA) (Harman, 1976) in order to obtain a set of orthogonal axes on which to project the studied sequences (dynamical space). [We recognize the essential mathematical equivalence between SVD and PCA. In the present context, we use the term PCA to distinguish this step in the algorithm from the SVD plus spectral analysis step.] The chosen "congeneric series" of proteins consists of 19 rubredoxins (Sieker et al., 1994), and corresponds to all the bacterial rubredoxins whose primary structures were known at the time of our analysis. The biological feature to be predicted is the exceptional stability of the rubredoxin from Pyrococcus furiosus, a thermophile bacterium living at very high temperatures.

Finally, the thermostability of rubredoxins was also investigated at a much finer level of detail by using our approach to evaluate the results by Eidsness et al. (1997), who synthesized six chimeric proteins originating from a thermophilic sequence and a mesophilic one. These authors observed the splitting of the six chimeric proteins into a thermophilic subset and a mesophilic subset, although a specific mechanistic explanation for this behavior could not be found (Eidsness et al., 1997). Even under such stringent sensitivity requirements, the predictive abilities of our approach were successful.

    MATERIALS AND METHODS
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

The biochemical problem

Rubredoxins are probably the simplest members of the ubiquitous and huge family of redox metalloenzymes (Sieker et al., 1994) and consist of a relatively short polypeptide chain (~53 AA) endowed with a prosthetic group in the form of a ferrous/ferric ion tetrahedrically bound to four cysteine (Cys) residues. Even though their exact metabolic role in anaerobic cells has not yet been fully clarified, their structural features are fairly well known, and in Fig. 1 A the primary structures of 19 bacterial rubredoxins are reported: they are quite similar, and >20% of the residues are strictly conserved, among which are the four Cys residues of the active site and the five aromatic residues that constitute the hydrophobic core of the proteins. Such a high level of homology, however, does not find a match in the huge spectrum of thermal stability of the bacterial strains from which they are extracted. In particular, it has been impossible up to now to find a rationale, on the basis of the primary as well as of the tertiary structures, for the fact that the rubredoxin from Pyrococcus furiosus (Rubr Pyrfu), which lives normally at 90°C, has a half-time of thermal denaturation of 400 h at 92°C, as compared to 6 h of Clostridium pasteurianum (Rubr Clopa) rubredoxin, which is the most similar to it in terms of 3D structure. In the same figure the phylogenetic tree of rubredoxins corresponding to their best alignment and the structure of the chimeric rubredoxins obtained by Eidsness et al. (1997) starting from Rubr Clopa and Rubr Pyrfu are also reported.



View larger version (64K):
[in this window]
[in a new window]
 
FIGURE 1   Primary structures of rubredoxins used in this work. (A) The best alignment of the primary structures of 19 rubredoxins as provided by the Clustal X program (see text). The reported sequences are all the known bacterial rubredoxins at the time of the completion of the present work (Sept. 1998) and derive from the Swiss Prot data base (see also Table 2). (B) The phylogenetic tree corresponding to the alignment in (A) as given by the neighbor joining method of Saitou and Nei (1987) and (Page, 1996). (C) The sequence of the six chimeric rubredoxins obtained by Eidsness et al. (1997) starting from Rubr Pyrfu and Rubr Clopa. The correspondence between the nomenclature in the original paper and in this one (from chim1 to chim6) is the following: Cp15|Pf, [M1  ,K2A, P15E]Cp, Cp15|Pf47|Cp, Pf15|Cp47|Pf, [  1M]Pf, Pf47|Cp. The last column in (C) reports the spectroscopic lifetime of each chimeric structure at 92°C estimated from 490 nm absorbance changes by Eidsness et al. (1997).

Data analysis

Dynamical methods

Each sequence was coded in terms of the amino acid hydrophobicities (Franke, 1984). The numerical discrete series corresponding to the protein sequence was then submitted to a 4D embedding by the method of delays (Broomhead and King, 1986). The embedding procedure consists in building an n-columns matrix (in our case n = 4) out of the original linear array, by shifting the series by a fixed lag. For example, given the series 10, 11, 21, 32, 41, 35, 40, 19... the corresponding 4D embedding space at lag of 1 (the discrete character of amino acid sequences dictates this choice) is:
<AR><R><C>10</C><C>11</C><C>21</C><C>32</C></R><R><C>11</C><C>21</C><C>32</C><C>41</C></R><R><C>21</C><C>32</C><C>41</C><C>35</C></R><R><C>32</C><C>41</C><C>35</C><C>40</C></R><R><C>41</C><C>35</C><C>40</C><C>19</C></R><R><C>35</C><C>40</C><C>19</C><C>·</C></R><R><C>40</C><C>19</C><C>·</C><C>·</C></R><R><C>19</C><C>·</C><C>·</C><C>·</C></R></AR>
The rows of the embedding matrix (EM) correspond to subsequent windows of length 4 (embedding dimension) along the sequence. The choice of the embedding dimension was dictated by a balance between the need for having a window large enough to keep track of between-residues interactions and on relying---at the same time---on a sufficiently long series (Broomhead and King, 1986; Webber and Zbilut, 1994). Notice that the last n values are eliminated from the analysis as an obvious consequence of shifting the series for the embedding. Moreover, the four-residues window was demonstrated (Strait and Dewey, 1996) through the application of a formalism derived from information theory, to constitute an upper limit for the information content of protein sequences. [We note that Kumosinski and Liebman (1994) have previously explored the use of a somewhat related methodology.] After the common step of building an EM, RQA and SVD diverge. While RQA, being based on the EM rows, assumes a local viewpoint over the time series (Giuliani et al., 1998; Zbilut et al., 1998a) SVD, being based on EM columns and dealing with average regularities, provides a global view of the series (Broomhead and King, 1986; Mandell et al., 1997). The former and the latter method are then particularly sensitive (Trulla et al., 1996), and relatively insensitive, respectively, to small perturbations and provide a complementary view of the autocorrelation structure of the time series (Zbilut et al., 1998a).

RQA-based descriptors. RQA was first introduced in physics by Eckmann et al. (1987) as a purely graphical technique (see the Appendix). Subsequently, Zbilut and Webber (1992) enhanced the technique by defining five nonlinear quantitative descriptors of the recurrence plot that were found to be diagnostically useful in the quantitative assessment of time series structure in fields ranging from molecular dynamics to physiology (Giuliani and Manetti, 1996; Webber and Zbilut, 1994; Faure and Korn, 1997).

The RQA descriptors used in this work are:
1. MDIST. Average Euclidean distance between the rows of EM;
2. REC (percent recurrence). This measure quantifies the fraction of the plot filled by recurrent points. It corresponds to the fraction of recurrent pairs over all the possible pairs of epochs or, equivalently, to the fraction of pairwise distances below the chosen radius among all the computed distances;
3. DET (percent determinism). This is the percentage of sequential recurrent points that form diagonal line structures in the distance matrix. DET corresponds to the amount of patches of similar hydrophobic/hydrophilic characteristics along the sequence;
4. ENT (entropy).The entropy is defined here in terms of the Shannon-Weaver formula for information entropy computed over the distribution of length of the lines of recurrent points and measures the richness of deterministic structures of the series;
5. MAXLINE (maximal line).This index is simply the length (in terms of consecutive points) of the longest recurrent line in the plot, and is inversely related to the largest positive Ljapunov exponent.

In Fig. 2 the recurrence plots of four rubredoxin sequences are shown. In addition to the above-mentioned descriptors, the distribution of recurrent points in the plot was quantified by recurrence displacement histograms (see Fig. 3), which report the average recurrence of the plot as a function of the displacement from the identity line (main diagonal). These histograms highlight the presence of regularities (quasi-periodic structures) in the distribution of recurrences along the series.



View larger version (30K):
[in this window]
[in a new window]
 
FIGURE 2   Recurrence plots of four rubredoxin sequences. The top line reports the recurrence plots of two thermophilic proteins (Rubr Pyrfu and chim6, see text for details) while in the bottom line the plots of two mesophilic proteins are shown (Rubr Clopa and chim2). The axes of the plots correspond to the residues' numbering along the polypeptide chain. Each time a recurrence is scored between a residue pair, the corresponding location is darkened. The darkened main diagonal corresponds to the trivial recurrence between each residue with itself. The plot derives its symmetric character from the symmetry of the distance operator. The different recurrence patterns between thermophilic and mesophilic sequences is common to all the analyzed structures (see text for details).



View larger version (18K):
[in this window]
[in a new window]
 
FIGURE 3   Recurrence displacement histograms of Rubr Pyrfu and Rubr Clopa. The average recurrence of the sequence is displayed as a function of the displacement from the main diagonal of the recurrence plot. Given that the main diagonal of the recurrence plot corresponds to the identity in time (i.e., sequence position), the displacement histograms can be equated to a sort of local autocorrelation integral function. The kurtosis of recurrence displacement was used to quantitatively discriminate mesophilic and thermophilic structures.

SVD-based sequence descriptors. SVD is a well known technique of time series analysis (Broomhead and King, 1986) and we refer to other papers for its implementation in the case of hydrophobicity plots (Mandell et al., 1997, 1998; Selz et al., 1998). Here it is sufficient to say that the technique involves the computation of the first three eigenfunctions (principal components) of the EM and the subsequent projection of the original series on the component space. The projection is then submitted to a complex pole maximum entropy spectral analysis.

In order to derive quantitative descriptors of these spectra, oblique principal components analysis (OPC, VARCLUS procedure in SAS) (Anderberg, 1973; Harman, 1976) was applied to the digitized SVD spectra of the 19 sequences under study (see the Appendix). OPC subdivided the 19 spectra into four clusters. Consequently, for each sequence, four coefficients (SP1-SP4) were computed, which correspond to the Pearson correlation coefficients of the respective spectrum with the four clusters that provide a global description of spectral shape. SP1-SP4 were used to parameterize the spectral information.

Other sequence descriptors. The SD of the residue hydrophobicities, the absolute value of the Pearson correlation coefficient (R) of the hydrophobicity of adjacent residues, and the LZ parameter (Kaspar and Schuster, 1987) of the sequences constituted the last three descriptors of the sequence space (Table 1). SD measures the relative heterogeneity of the amino acid hydrophobicity, and is a purely statistical (shuffling-resistant, order independent) index; in contrast, R is order-dependent and estimates the short-range regularities in the sequence. The LZ index is an easily calculable estimate of the algorithmic complexity of a symbolic sequence (Kaspar and Schuster, 1987), and is strictly dependent upon the ordering of amino acids in the primary structure (see the Appendix).


                              
View this table:
[in this window]
[in a new window]
 
TABLE 1   Dynamical variables used in this work

Sequence homology methods

The final output of any quantitative sequence comparison method is a distance matrix whose elements contain the estimated divergence between the sequences in the corresponding row and column (Pearson and Lipman, 1988). A popular graphical representation of such matrices, especially useful to visualize evolutionary relationships, is in the form of dendrograms (Sneath and Sokal, 1973). The distance matrices relative to symbolically coded primary structures used in this work have been generated through the Clustal X (Thompson et al., 1997) program. The standard alignment algorithm in Clustal X can be enriched by two optional refinements: the NoGap (NG) and the multiple substitution (MS) options. The former does not allow any gap in the global alignment: this guarantees that "like" is always compared to "like," although it may exclude much of the information in the global alignment if many gaps are present. The latter (MS) is particularly useful for largely divergent sequences. In such a case, in fact, it becomes increasingly likely (as a result of equally time-spaced random events) that more than one substitution will have happened at the same site (Saitou and Nei, 1987; Higgins and Sharp, 1989). In any case, the strong correlation we observed among all the three sequence homology metrics in our data set (HOMOL1-HOMOL3 in Table 6), points to a substantial stability and algorithm-independence of the sequence alignments for rubredoxins.

    RESULTS
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

The RQA and SVD methods generated 13 variables (Table 1) describing the protein hydrophobicity sequences from a dynamical perspective. In order to reduce the number of variables to a more manageable size, and to identify the effective (orthogonal) axes spanning the space under study, the data set was analyzed by PCA. In order to check for consistency and robustness of the dynamical descriptors, two separate PCAs were performed using 1) all 13 variables and 2) a "reduced" set lacking the SP1-SP4 variables.

The first four principal components extracted from the complete set of variables were used to define a suitable space in which the structure-function relationships of interest could be identified. The correlation between classical sequence comparison and the dynamical methods was performed by a simple Pearson coefficient between the distance matrices corresponding to 1) each of the three investigated sequence homology metrics, and 2) the 4D space of the major principal components extracted from the dynamical descriptors. The data matrix for the global dynamical characterization of the 19 sequences is reported in Table 2.


                              
View this table:
[in this window]
[in a new window]
 
TABLE 2   Dynamical descriptors of rubredoxins used in this work

The set of variables shown in Table 2 without the SP1-SP4 variables (thus mainly based on RQA), and submitted to PCA, produced a four-component solution explaining 92% of the total variability: such a compression is due to the high correlation among descriptors. The factor loading matrix, containing the correlation coefficients between original variables and the four components, is reported in Table 3. Sketching an interpretation of the components is made possible by considering which original variables obtain the larger loadings on each component (see the legend to the table).


                              
View this table:
[in this window]
[in a new window]
 
TABLE 3   Factor loadings of the reduced set of dynamical variables

The SP1-SP4 variables, excluded from the above analysis, deal with the classification of the sequences into four well separated families of spectra (Fig. 4) and, from a computational point of view, are based on a completely different approach (see the Appendix). Moreover, since FD provides a poor representation of the spectral shapes, including SP1-SP4 in the PCA analysis implies the use of brand-new information. This is true even from a purely statistical point of view, given that SP1-SP4 are "almost" mutually orthogonal (see Methods). The "spectral shape" information has approximately the same dimensionality (four) as the PC1r, ... , PC4r space. Thus its addition to the reduced set of variables could, in principle, strongly perturb its PCA solution. If, however, the PCA solution of the complete set of variables remains substantially the same, this suggests an internal consistency of the dynamical descriptors.



View larger version (25K):
[in this window]
[in a new window]
 
FIGURE 4   Representative elements of rubredoxins' spectral classes generated by OPC analysis. The figure reports in arbitrary units (Arb) the SVD spectra of the most representative (highest correlation coefficient) elements of the spectral clusters generated by OPC analysis over rubredoxins' primary structures. The correlation coefficients (SP1-SP4) between each spectrum and the four clusters were used as synthetic quantitative descriptors of SVD analysis of sequences together with the dominant frequency (FD).

Table 4 shows that PCA of the reduced set of variables produces a practically equivalent solution as compared to the complete set. The latter solution explains 86% of total variability and shows a one-to-one correspondence (Pearson r) with the components of the reduced set (Table 5). Hence, the new set of components (PC#f) can be faithfully used as dynamical descriptors of the protein data set (for the meaning of the components see the legend to Table 4).


                              
View this table:
[in this window]
[in a new window]
 
TABLE 4   Factor loadings of the full set of dynamical variables


                              
View this table:
[in this window]
[in a new window]
 
TABLE 5   Correlation matrix between the reduced (PC#r) and the full set (PC#f) of dynamical components

The relationship between the dynamical description of proteins and their sequence homology description has been checked by correlating two sets of "between-sequences" distance matrices. One set includes three distance matrices (HOMOL1, HOMOL2, HOMOL3) generated by three different sequence homology metrics; the other set includes just one member, i.e., the Euclidean distance matrix between the same 19 sequences in the 4D PC1f/PC4f space (DYNAM). The four distance matrices are directly comparable by means of a Pearson coefficient (Table 6). The three sequence-matching algorithms are completely equivalent (Pearson r approx  1), while the dynamical description is proven to be practically independent from them (Pearson r approx  0.22).


                              
View this table:
[in this window]
[in a new window]
 
TABLE 6   Correlation between sequence homology and dynamical variables distance matrices of rubredoxins

The internal consistency of the dynamical description of the hydrophobicity profiles provides a solid basis for the attempt to recognize known structural and/or functional features inside the "dynamical" space. The most relevant feature to look for is, in the present case, the high thermal stability of Rubr Pyrfu rubredoxin. Looking at the rubredoxins' location in the PC2f-PC3f plane, shown in Fig. 5, Rubr Pyrfu rubredoxin is located at an extreme of both the second (PC2f) and the third (PC3f) axes of the dynamical space, whereas such a peripheral location does not find a counterpart in the sequence alignment metrics (Fig. 1 B) where Rubr Pyrfu rubredoxin is located well inside the mesophilic sequence space.



View larger version (11K):
[in this window]
[in a new window]
 
FIGURE 5   Distribution of rubredoxins' sequences in a principal components' space. Notice the "unique" character of Rubr Pyrfu and the structural resemblance between Rubr Clopa and Rubr Pyrfu which are relatively close in the PC2f-PC3f plane.

Since Rubr Clopa rubredoxin is the nearest neighbor of Rubr Pyrfu rubredoxin in the PC2f-PC3f plane, the 3D structural similarities between Rubr Pyrfu and Rubr Clopa rubredoxins is recognized by the dynamical description (Fig. 5), whereas the sequence alignment metrics fail in this respect. The modes of the hydrophobicity spectrum were interpreted (Selz et al., 1998) as secondary and supersecondary structural features of the protein molecule; from a purely computational point of view, the modes correspond to peaks in the autocorrelation structure of the series. In the RQA perspective, the presence of peaks of autocorrelation at well-defined scales can be appreciated by plotting the amount of recurrence (i.e., the number of recurrent pairs) on the relative displacement along the chain of all the residue pairs. The recurrences' displacement histograms of Rubr Pyrfu and Rubr Clopa reported in Fig. 3 provide a much closer look at the recurrence structure of the two sequences as compared to REC or DET. We relied on these histograms to tackle the "high resolution" part of our analysis; i.e., the discrimination between mesophilic and thermophilic structures in the space spanned by Rubr Clopa, Rubr Pyrfu and the six chimeras (Eidsness et al., 1997). This problem is on a much more detailed scale than the previous one, since all the sequences are located in a small fraction of the component space (Fig. 6). Thus we needed a finer look at the recurrence plots in order to solve it.



View larger version (10K):
[in this window]
[in a new window]
 
FIGURE 6   Distribution of native and chimeric rubredoxins in a principal components' space. The chimeric sequences were added as external elements (test set) to the PC2f-PC3f plane spanned by the 19 rubredoxins (training set). As expected, the chimeric sequences (+), labeled as in Fig. 1 C, occupy a small portion of the space corresponding to the area between Rubr Pyrfu and Rubr Clopa (their parental sequences).

A simple inspection of Fig. 2 reveals a macroscopic difference between "thermophilic" and "mesophilic" recurrence plots: while thermophilic hydrophobicity series display a regular distribution of recurrences along lines parallel to the main diagonal, reminiscent of a quasi-periodic distribution of similar hydrophobicity patterns; mesophilic series display a dense clustered distribution of recurrences. Such a difference is present in all the examined structures and, in order to quantify it, we computed the kurtosis of the recurrence displacement distribution for all the sequences. The results, reported in Fig. 7, show that the thermophilic proteins have a low kurtosis corresponding to a quasi-normal distribution of recurrences (a normal distribution has a kurtosis equal to 3) as compared to the relatively higher kurtosis of mesophilic structure corresponding to a peaked (or clustered) distribution of recurrences. Thus the kurtosis of recurrence displacement distributions is a simple index allowing for a clearcut separation of the two stability behaviors on a quantitative, completely data-driven, basis.



View larger version (15K):
[in this window]
[in a new window]
 
FIGURE 7   Kurtosis of the recurrences' distribution of mesophilic and thermophilic rubredoxins' recurrence plots. Thermophilic proteins display a uniformly lower kurtosis than mesophilic ones; given that kurtosis is an index of the relative "peaked" character of a distribution, this result points to a more concentrated location of recurrences in mesophilic proteins with respect to thermophilic ones. The line marks the expected kurtosis for a Gaussian distribution.

From a structural point of view, this means that the thermophilic sequences have a "multiple scale" correlation structure made up of both short and long range correlations as compared to the single-scale correlation (just one peak of recurrences) in mesophilic sequences. The 3D structural counterpart of such a pattern are shown in Fig. 8 and are further discussed below.



View larger version (44K):
[in this window]
[in a new window]
 
FIGURE 8   Location of recurrent fragments over tertiary structures in Pyrfu and Clopa rubredoxins. The figure can be considered the structural counterpart of Fig. 2 and reports the 3D structure of both Rubr Pyrfu (A) and Rubr Clopa (B). The colors identify in the 3D structure of both Rubr Pyrfu and Rubr Clopa the long-range recurrences forming the deterministic lines of the recurrence plots. It is apparent how very similar REC, DET values (3.37, 53.49, and 3.84; 51.06 for Rubr Clopa and Rubr Pyrfu, respectively) go hand-in-hand with a completely different distribution of recurrences in the 3D space: widespread for Rubr Pyrfu, concentrated for Rubr Clopa.

    DISCUSSION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

The main result of the present study is that consideration of the physicochemical semantics (hydrophobicity profiles) of protein primary structures, together with the computation of order-dependent dynamical descriptors, generates completely novel information with respect to classical homology analysis, and indicates new exploration pathways for studying sequence/function relationships of proteins. This has been demonstrated by showing that 1) at difference with classical best-alignment methods, a classification of 19 bacterial rubredoxin primary structures is able to unequivocally single out the only thermophilic element of the set; and that 2) thermal sensitivity can be modeled, on a finer scale of six chimeric sequences derived from a thermophilic and a mesophilic protein, by the analysis of recurrence plots of the corresponding hydrophobicity profiles.

It is worth noting the different assumptions at the basis of sequence alignment and dynamical metrics for the classification of protein sequences. Sequence alignment metrics generate a "phylogenetic space" (Kimura, 1983; Saitou and Nei, 1987) in which proteins are compared in terms of the number and location of mutations needed to go from one structure to another and, in so doing, implicitly measure the underlying evolutionary process. On the contrary, the dynamical approach compares sequences in terms of their resemblance in the global ordering of hydrophobicity values along a chain that 1) emphasizes a physiological, synchronous view; and 2) implies that a similar kind of ordering can be achieved even by a relatively diverse use of amino acid residues in different proteins. These features may explain the statistical independence of the two metrics in the considered data set.

The premise at the basis of the dynamical approach is that the primary structure "encodes" the folding features of proteins and that the "folding code" can be reconstructed by order-dependent descriptors of polypeptide sequences (Mandell et al., 1997; Selz et al., 1998). In this respect, the dynamical approach can be considered a statistical "mean field" approximation, which, through the agency of time series analysis methods, phenomenologically captures an "equilibrium" folding state. This assumes that the individual amino acid residues are part of an interacting system with long-range correlations, which results in scaling and critical behavior. [Some evidence for such a view are demonstrated by Manetti et al. (1999).] In such a frame, hydrophobicity has been considered the variable of election for energy minimization, due to the vast amount of experimental and theoretical evidence (Weiss and Herzel, 1998; Li et al., 1997; von Heijne, 1982; Sweet and Eisenberg, 1983) pointing to it as the most crucial parameter in determining protein 3D structures.

Focusing on global properties of a sequence (dynamical approach), as opposed to local features (sequence homol-ogy), changes the point of view on protein sequence/function relationships. The holistic character of the dynamical approach forces the analysis to the congeneric series level; i.e., at the level of proteins having the same kind of activity and comparatively similar structures. In this common frame of sequence/activity relationships, the dynamical approach can be very useful (Zbilut et al., 1998b) to model the modulation of such relationships. The need for studying such homogeneous classes is a direct consequence of the fact that the same values of dynamical descriptors can be reached by completely different sequences pertaining to unrelated proteins, which is another important resemblance between the proposed approach and medicinal chemistry QSAR studies. Even in this field, in fact, the global values of physicochemical descriptors used to model and predict biological activity of organic molecules [e.g., octanol/water partition coefficient, highest occupied molecular orbital (HOMO), etc.] can assume the same value for very different structures (Franke, 1984) and constrains the analysis within a particular congeneric series of molecules with the same basic "skeleton" and different substituents (Franke, 1984; Martin, 1981; Hansch, 1993). The congeneric series paradigm allows, however, for a nonequivocal definition of the biological activity to be modeled: all the considered organic molecules are potentially active, since all of them possess the basic determinants of that biological activity and only differ from each other in comparatively minor elements (substituents) modulating the activity. This allows for the use of observed differences in the physicochemical descriptors to model such modulation (Franke, 1984; Hansch and Leo, 1995; Martin, 1981).

In the present case we adopted the same paradigm, taking as a congeneric series the rubredoxin family, within which we modeled the thermophilic character. The problem of predicting the thermal stability of proteins could be posed in many general ways, although to our knowledge no general determinant of thermal stability, independent from a specific context, has yet been identified (Adams and Keltzin, 1996). As a matter of fact, the congeneric series paradigm showed itself to be amazingly successful even in the difficult case of the chimeric protein set, which spanned a much more limited portion of the sequence space as compared to the rubredoxin classifications (Fig. 6), and needed a much more detailed dynamical description. This was obtained by shifting from an average description of recurrence plots based upon synthetic indexes, to a more detailed description of the plots' shape (distribution of recurrences) which only required a change in the quantification of the same basic dynamical description (recurrence plots), without shifting to a brand new parameterization. More importantly, the description maintained its "holistic" character by taking into account the entire sequence. Fig. 8 provides some help in exploiting the bulk of structural information embedded in the recurrence plots of Rubr Clopa and Rubr Pyrfu (Fig. 2), and emphasizes the different distribution of recurrent fragments, in their 3D representations, over essentially identical backbones. In the Rubr Clopa case, the concentration of recurrence lines in the same area of the recurrence plot (Fig. 2) is paralleled by a marked concentration of the long-range recurrences that mainly occur between two well-defined locations on the polypeptide chain (Fig. 8). In the Rubr Pyrfu case, there is no preferentially populated area in the recurrence plot, and Fig. 8 shows that recurrent fragments are widespread over the whole backbone. The two situations can be viewed as a different spatial distribution of the same amount of REC (3.84 and 3.37 in Rubr Pyrfu and Rubr Clopa, respectively) both in 1 and in 3D spaces. In either case, however, it is impossible to identify one (or a few) localized residues specifically responsible for the different thermal stability of the two proteins.

If it is yet difficult to infer in general terms a well-defined set of stabilizing interactions from a recurrence distribution of hydrophobicity along primary structures, it is worth noting that the information conveyed by recurrence patterns well exceeds the simple observation of repetitive chemical motifs. In the specific case of rubredoxins, in fact, the four strictly conserved short fragments centered on cysteins in positions 6, 9, 39, and 41, which bind the prosthetic groups, are all included into the deterministic fraction of recurrences both in Rubr Clopa and in Rubr Pyrfu (see Fig. 8); only in Rubr Pyrfu, however, similar deterministic patterns not immediately perceivable from periodic chemical identity of residues can be found in completely different regions.

In any case, our conclusions are in agreement with the work of Eidsness et al. (1997), who state:

"... Since our results do not identify a few dominant localized interactions, we suggest that the extraordinary thermostability of Rubr Pyrfu may involve a precise optimal alignment of a large number of residues, whose network of interactions are very sensitive to small structural changes dictated by the context of the sequence."

This statement corresponds to the peculiar quasi-periodic pattern in the recurrence plot of Rubr Pyrfu (Fig. 2), which can be, in principle, destroyed by a few mutations interrupting the lines of determinism, but cannot be considered as a "local" feature (i.e., specific amino acids responsible for thermostability do not exist), since it spans the entire sequence in the form of long range correlations. On completely different, and exclusively methodological, grounds, it is worth noting that time series analysis methods, given their holistic character, pose no constraints for the relative length of the sequences to be compared. Finally, besides the possibility of deriving useful sequence/function relationships for specific situations, the independence between dynamical and sequence homology metrics opens the way to the exploration of larger data bases, looking for functional similarities among proteins by a method complementary to sequence homology analyses.

    APPENDIX
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

Computing dynamical descriptors of hydrophobicity sequences

RQA descriptors

RQA is based on the computation of a distance matrix, DM, between any possible pair-combination of rows (epochs) of an EM. The distance matrix is then colored, darkening the pixels located at specific (i, j) coordinates, corresponding to distance values between ith and jth rows (epochs) lower than a predefined radius (for details see Giuliani and Manetti, 1996; Webber and Zbilut, 1994)].

The features of the distance function make the plot symmetric (DMi, j = DMj, i) and with a darkened main diagonal corresponding to the identity line (DMi, j = 0 when j = i) (Fig. 2). The darkened (recurrent) points single out recurrences within the series and the plot can be considered as a global picture of the autocorrelation structure of the system (Webber and Zbilut, 1994; Giuliani et al., 1998). In other words, the recurrence plot visualizes the distance matrix between the epochs (rows) of the embedding matrix and consequently the autocorrelation present in the signal at any possible scale. Since, in fact, distances are computed for all the possible pairs of epochs, the elements close to the main diagonal of the plot correspond to short-range correlations (the diagonal marks the identity in time), while long-range correlations correspond to points distant from the main diagonal. Besides the global impression given by the graphic appearance of the plot (Fig. 2), the indexes developed by Zbilut and Webber (1992; Webber and Zbilut, 1994; Trulla et al., 1996) allow for a quantitative description of the recurrence structure of the plot (see Methods). The computation of such indexes implies the setting of a recurrence threshold (radius) that in our case was set to 3 and a line length (minimum number of subsequent recurrent points to be considered as deterministic) that in our case was set to 3.

SVD-based descriptors

SVD spectral analysis (Mandell et al., 1997; Broomhead and King, 1986) is based on the computation of a correlation matrix (CM) among the columns of an EM. The first three eigenvectors of the CM are then computed by SVD and used to estimate the original series S (primary structure) by a linear least-squares fit. It is worth noting that the eigenvectors are extracted in order of explained variance, so that the first vectors are more representative of the correlated, signal-like part of the information contained in S (Broomhead and King, 1986). The estimated series, S', maintains all the general features of S filtered by noise, and are subsequently submitted to a maximum entropy, complex poles, power spectrum analysis (Selz et al., 1998). The result is parameterized in terms of the dominant frequency (FD) and used as a whole (with a 1000-point sampling). In order to derive quantitative synthetic parameters of the proteins' whole spectra, the 19 1000-point-long arrays corresponding to the digitized spectra were analyzed by means of oblique principal component analysis (OPC) (Anderberg, 1973; Harman, 1976). OPC analysis attempts to divide a set of variables (in our case digitized spectra) into nonoverlapping clusters such that each cluster can be considered as essentially unidimensional. The clusters are formed with the goal to include in each of them variables that are both highly intercorrelated (maximal variance explained by the cluster first component), and as much as possible independent from those included in the other clusters (minimal correlation between clusters). The procedure consists in a stepwise increase of the clusters' number until a user-specified criterion, involving either the percentage of variation accounted for by the global solution or the between-clusters relative independence, is fulfilled. In our case OPC generated a four-cluster partition of the spectra. The correlation coefficients between the original spectra and the center of mass of the clusters (SP1-SP4) were then used to quantitatively parameterize spectral information for the subsequent analysis.

LZ parameter

LZ transforms the representation of a numerical sequence into a binary format, substituting 1 for the higher-than-median values and 0 otherwise. This binary sequence is then analyzed trying to generate any subsequent configuration of 1's and 0's from the previous one using just two operators: copy and insert acting on the initial sequence. Starting from an initial random sequence, Sr, the procedure progressively reconstructs any predefined series: the number of instructions (copy plus insert operations) needed to produce the series, normalized by the number of instructions needed to generate the corresponding random sequence, constitutes the LZ index (Kaspar and Schuster, 1987).

Software

RQA was performed by using the original Webber and Zbilut programs that can be freely downloaded from http://homepages.luc.edu/~cwebber.

SVD analysis was performed by CDA (Chaos Data Analyzer) software by J.C. Sprott of the University of Wisconsin and George Rowlands of the University of Warwick. The same program was used for the computation of the LZ index and of the r value between adjacent residues.

All statistical routines were performed by SAS (SAS Institute Inc., NY, 1990), Version 6.2 (1998) for Unix systems.

    ACKNOWLEDGMENTS

This work has been partly supported by Italian M.U.R.S.T. (40% and 60%) grants to A. Colosimo and C.N.R. (Grant CTB CNR 96.03746.CT114-INV557)

    FOOTNOTES

Received for publication 31 December 1998 and in final form 20 October 1999.

Address reprint requests to Dr. Alfredo Colosimo, Department of Biochemical Sciences, University of Roma "La Sapienza," 00185 Roma, Italy. Tel.: 39-06-499-10957; Fax: 39-06-444-0062; E-mail: a.colosimo{at}caspur.it.

    REFERENCES
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
APPENDIX
REFERENCES

Biophys J, January 2000, p. 136-149, Vol. 78, No. 1
© 2000 by the Biophysical Society   0006-3495/00/01/136/14  $2.00



This article has been cited by other articles:


Home page
Biophys. JHome page
K. A. Selz, A. J. Mandell, M. F. Shlesinger, V. Arcuragi, and M. J. Owens
Designing Human m1 Muscarinic Receptor-Targeted Hydrophobic Eigenmode Matched Peptides as Functional Modulators
Biophys. J., March 1, 2004; 86(3): 1308 - 1331.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
A. Giuliani, P. Sirabella, R. Benigni, and A. Colosimo
Mapping protein sequence spaces by recurrence quantification analysis: a case study on chimeric structures
Protein Eng. Des. Sel., October 1, 2000; 13(10): 671 - 678.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
J. P. Zbilut, C. L. Webber Jr, A. Colosimo, and A. Giuliani
The role of hydrophobicity patterns in prion folding as revealed by recurrence quantification analysis of primary structure
Protein Eng. Des. Sel., February 1, 2000; 13(2): 99 - 104.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Giuliani, A.
Right arrow Articles by Colosimo, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Giuliani, A.