help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Klepeis, J. L.
Right arrow Articles by Floudas, C. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Klepeis, J. L.
Right arrow Articles by Floudas, C. A.
Biophysical Journal 85:2119-2146 (2003)
© 2003 The Biophysical Society

ASTRO-FOLD: A Combinatorial and Global Optimization Framework for Ab Initio Prediction of Three-Dimensional Structures of Proteins from the Amino Acid Sequence

J. L. Klepeis and C. A. Floudas

Department of Chemical Engineering, Princeton University, Princeton, New Jersey

Correspondence: Address reprint requests to C. A. Floudas, Dept. of Chemical Engineering, Princeton University, Princeton, NJ 08544-5263. Tel.: 609-258-4595; Fax: 609-258-0211; E-mail: floudas{at}titan.princeton.edu.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The field of computational biology has been revolutionized by recent advances in genomics. The completion of a number of genome projects, including that of the human genome, has paved the way toward a variety of challenges and opportunities in bioinformatics and biological systems engineering. One of the first challenges has been the determination of the structures of proteins encoded by the individual genes. This problem, which represents the progression from sequence to structure (genomics to structural genomics), has been widely known as the structure-prediction-in-protein-folding problem. We present the development and application of ASTRO-FOLD, a novel and complete approach for the ab initio prediction of protein structures given only the amino acid sequences of the proteins. The approach exhibits many novel components and the merits of its application are examined for a suite of protein systems, including a number of targets from several critical-assessment-of-structure-prediction experiments.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Structure prediction of proteins from their amino acid sequences is regarded as a holy grail in the computational chemistry, molecular, and structural biology communities. The basic premise, according to the thermodynamic hypothesis, is that the native structure of a protein in a given environment corresponds to the global minimum free energy of the system. Despite pioneering contributions and decades of effort, the ab initio prediction of the folded structure of a protein remains a very challenging problem. The challenge is a result of the complex relationships between both the accurate modeling and sufficient conformational sampling of these protein systems.

To avoid the difficult task of full ab initio structure prediction of proteins, database-driven methods have received considerable attention. Database-driven methods differ fundamentally from physics-based ab initio approaches in that they utilize knowledge-based information from structural databases. For purposes of critical assessment of structure prediction (i.e., CASP; CASP meetings held at Asilomar, in California, every two years), existing approaches for protein structure prediction are commonly classified as 1), comparative modeling; 2), fold recognition; or 3), ab initio methods. Although these general classifications exist, the distinction for many protein structure prediction approaches has become blurred. This is especially true for the so-called ab initio approaches since several of such denoted ab initio approaches actually rely on structural and statistical databases. In addition, these methods typically build upon or borrow concepts from other techniques, and only a handful of approaches can truly be classified as ab initio according to a full physiochemical denotation.

To understand this important distinction, it is useful to examine the application of several approaches to the prediction of a three-dimensional structure for a target sequence. A complementary classification scheme, which parallels the classification of prediction approaches, may involve the following identification of targets as discussed at the recent CASP5 meeting: 1), template refinement for methods in which the homology and sequence alignments are not difficult and the main goal becomes positioning of side chains; 2), template modeling, which must handle difficulties in both the detection of correct homology, alignment of sequence, and generation of correct template; and 3), free modeling, which is a general class for targets for which no discernible template exists. The classification of targets in this way facilitates the discussion of methods for determining accurate protein structures.

Database information is used most directly in the application of traditional comparative modeling methods for template refinement. Generally, classic comparative modeling is applied when the similarity between the target and parent structures is extensive and the problem becomes the refinement of an easily (relatively) identifiable template. Comparative modeling methods identify database templates by employing BLAST or PSI-BLAST searches against sequence databases (Altschul et al., 1997Go). When the homology between the parent and target sequences is high, the sequence-structure alignments can be derived directly from the PSI-BLAST results (with possibly some manual adjustments around insertion and deletion regions). Other multiple sequence alignment methods can also be used to examine the confidence of the sequence structure alignments (Notredame et al., 2000Go; Thompson et al., 1994Go). It is interesting to note that the fundamental problems of template refinement, such as loop closure and side-chain refinement, remain a challenge (Fiser et al., 2000Go; Tosatto et al., 2002Go; Xiang et al., 2002Go). However, many comparative modeling and fold recognition methods neglect these details and employ well-known algorithms for model generation (Sali and Blundell, 1993Go) and side-chain rebuilding (Bower et al., 1997Go). Classic methods for comparative modeling remain largely unchanged and have been applied with success to template refinement targets at multiple rounds of the CASP competition (Venclovas, 2001Go). An extensive review of comparative modeling can be found elsewhere (Moult et al., 2001Go).

As the name suggests, fold recognition approaches attempt to address the difficulties with remote detection of the correct fold for a target sequence. Such template modeling is necessary when the structural template of a protein in the database has very subtle or no obvious sequence relation to the target protein. The ability to effectively apply a fold recognition technique relies on the fact that structure is more evolutionary conserved than sequence. However, for distantly related proteins, the amount of similar structure may be relatively small. Thus, successful prediction of the target structure hinges not only on the alignment of regions of similar structure but also on the prediction of dissimilar regions. By extending this template modeling philosophy it may be possible to identify a composite structure from the fragments of different templates, which can then be used in combination to approximate the target structure. Of course this is where the distinction between fold recognition and so-called ab initio methods becomes vague because, for targets requiring template modeling, virtually all ab initio methods utilize information from sequence and structural databases. For the limit in which the correct fold of the target protein is a new fold, fold recognition methods are often complemented with or replaced by certain statistical techniques and thereby evolve into an ab initio counterpart.

The first step in locating remote low sequence identity structural homologs from the protein structure databases usually involves a check for evolutionary-related sequences using methods such as PSI-BLAST (Altschul et al., 1997Go). Multiple profiles from sequences with the same fold can be used to compute local and global alignments with a position-specific scoring matrix. The use of such sequence profiles aims at identifying the most evolutionarily distant homologs. However, sequence considerations alone will fail to link possible structural homologs with no common sequence patterning. Threading, an extension of the classic fold recognition approaches, attempts to identify such linkages by focusing on the possibility of shared structural motifs in the absence of detectable sequence homology.

The utilization of structural information is a significant feature of many fold recognition approaches. One particular strategy involves the inclusion of secondary structure predictions into fold recognition algorithms. An important observation supporting this technique is that pairwise secondary structure similarity can exceed 80% for certain pairs of sequences exhibiting <10% sequence similarity. Of course, the success of these types of approaches then also relies on the accuracy of secondary structure prediction for the target sequence. Given that recent methods for secondary structure prediction have reached 75% accuracy (although not comparably nor consistently for helix and strand predictions; see Cuff et al., 1998Go; Jones, 1999bGo; Rost and Sander, 1994Go), the use of predicted secondary structure for template modeling has become a significant component of successful fold recognition approaches. The form of this information can be as a string identifying the three-state prediction of the target sequence (DiFrancesco et al., 1997Go; Koretke et al., 1999Go), or as a map of the segments of {alpha}-helical and ß-strands (Russell et al., 1996Go). Initial methods have relied on single secondary structure predictions for the target sequence; however, the finding that the combination of different predictions can create an exactly matched target has led to the development of methods for template alignment and modeling based on composite secondary structure predictions (An and Friesner, 2002Go).

Regardless of the incorporation of composite structural information for template alignment and modeling in fold recognition approaches, there remain problems with defining boundaries and insertions for the predicted models. For example, two proteins can share quite similar secondary structure motifs but differ dramatically in their overall three-dimensional structure. This is especially problematic when composite models are built, and is in part a consequence of the difference in length of the aligned sequence segments. In particular, a small sequence or segment may match very well within an overall longer template sequence, although the topology of the template may be very different and more complex due to the larger size of the overall template domain. This can be handled by penalizing sequence length differences, but will only be effective when combined with accurate domain parsing (Contreras-Moreira and Bates, 2002Go). A related problem is the correct identification of sequence insertions, which may represent domain boundaries or topological differences in the target sequence. Overall, the extent of success for fold recognition approaches is characterized by several features that belie some of the limitations of these techniques. As both the LIVEBENCH and CASP5 results demonstrate, the most successful predictions are often based on consensus prediction servers (metaservers) that attempt to select the best model according to the a ranking of independent fold recognition methods (Lundstrom et al., 2001Go). Beyond that, there are often concerted efforts and needs for human intervention to manually adjust the results of the template alignments and models generated by these database methods. These observations hint at the limits under which the database-dependent approaches operate. Other analyses of some of these methods can be found elsewhere (Moult et al., 2001Go; Murzin, 2001Go).

The blurry transition from fold recognition to ab initio approaches using structural databases can be understood by closer examination of the features of certain ab initio approaches. By definition, when considering new folds, ab initio approaches must not require the databases to possess proteins with global structural similarity to the new fold topology. Composite fold recognition approaches may also identify new fold topologies, although the designs of fold recognition techniques are better suited for matching of longer sequence fragments with modest insertion modeling. In practice, many ab initio approaches using databases represent the confluence of insertion modeling on a larger scale with fold recognition on a smaller scale. With this in mind, these approaches build template models through the extraction of fragments from general or tailored databases. This extraction process may involve the use of fragments with sequence similarity to fragments of the target sequence, or the use of fragments with structural similarity (secondary structure) to the predicted structure of fragments of the target sequence.

Fragment-based ab initio approaches have become extremely popular methods for exploiting database information. For small fragments the dependences on local conformational preferences are exploited, and the buildup of fragments has been implemented through a Monte Carlo procedure with a scoring function based on the Bayesian probability of sequence and structure matches (Simons et al., 1997Go, 1999Go). These ideas have been enhanced through the incorporation of secondary structure predictions as well as the addition of terms to favor the assembly of ß-sheets (Bonneau et al., 2001Go). Another novel method uses the hierarchical application of multiple sequence alignment and threading to produce template fragments from which starting lattice models are built (Skolnick and Kihara, 2001Go; Skolnick et al., 2001Go). In this case, the hierarchical component reflects the obvious link between fold recognition and ab initio modeling. Several other outstanding methods employ predicted secondary structure, and thereby focus on the assembly of these predefined elements of structure, which are themselves obtained by predictions of methods based on databases (Eyrich et al., 1999aGo,bGo; Xia et al., 2000Go). One such method illustrates the utility of the deterministic {alpha}BB global optimization method for prediction of tertiary structure models (Eyrich et al., 1999aGo; Standley et al., 1999Go).

Discussion of these approaches also highlights another point, which is that the examination of the importance of certain features among ab initio methods using databases is difficult because of the inherent variability with which these methods depend on the database information (as related to sequence and structure similarity to proteins in these databases). On the other hand, the physics-based ab initio approaches lend themselves to, and even demand for, identical application for all types of target sequences. It is only under these conditions that the success and general application of an ab initio approach can be accurately assessed.

From the physiochemical point of view, there is a clear distinction between a true ab initio approach and an ab initio approach relying on sequence and structural databases. In the case of a true ab initio approach, the fundamental and driving principles for understanding protein folding rely upon Anfinsen's observation that the native tertiary structure of a protein corresponds to the conformation which minimizes the free energy of the system. This free energy of the protein depends upon the different interactions within the protein system—ionic interactions, nonbonded interactions, hydrogen bonding, hydrophobic interactions, steric and torsional effects, protein-solvent interactions, and entropic effects. Each energetic effect can be modeled mathematically, using fundamental knowledge of electrostatics and physical chemistry. As a result, the free energy of a protein can be expressed as a function of the positions of the atoms making up that protein. The native conformation of the protein then corresponds to the set of atomic locations providing the minimum possible value of the free energy function. Mathematically, ab initio protein folding is treated as a global optimization problem—a problem in which the goal is to locate the values of a variable set (in this case, the locations of the atoms in the protein) that describe the minimum possible value of a certain function (in this case, the free energy function).

A series of pioneering ab initio methods have been based on the hierarchical prediction of protein structures using detailed physics-based models (Liwo et al., 1998Go, 1999Go, 1997aGo,bGo; Pillardy et al., 2001Go). These approaches begin with reduced models of an all-atom force field, and subsequent conversion and refinement of the coarse model to an all-atom model. Recent work has involved the inclusion of multibody terms to improve the modeling of ß-sheet formation. Unlike the database-driven methods, these physical (ab initio) models avoid the difficulties of coordinating insertions and deletions, as well as the associated unreliabilities in template alignment and side-chain positioning. One major advantage is that ab initio methods tend to be generic in how the physical processes of protein folding are modeled. This generality allows for not only the application of an ab initio approach to the structure prediction of any protein sequence, but may also lead to the understanding of the pathways that lead to the folded structure.

In this article a method true to the physiochemical perspective of ab initio structure prediction is presented. Underlying the structure prediction framework is the reconciliation of two competing views of protein folding. First, the predominance of local interactions in the fast formation of helical segments is used as a basis in detailed free energy calculations for subsequences of the overall target sequence. These calculations are used to identify initiation and termination sites of helices. In the second stage a model of hydrophobicity is employed in the simultaneous identification of ß-strands and prediction of a ß-sheet topology through the solution of a combinatorial optimization problem to maximize hydrophobic contacts. After deriving constraints on the system based on the previous stages and after additional calculations for loop segments, an overall three-dimensional structure is predicted using a combination of deterministic global optimization, stochastic optimization, and torsion-angle dynamics. This approach, known as ASTRO-FOLD, represents a combinatorial and global optimization framework for the ab initio prediction of three-dimensional structures of proteins. An overall schematic of the approach is provided in Fig. 1. The next four sections outline the stages of the overall approach, which are then followed by two sections describing the results for several benchmark systems including a variety of targets from recent CASP experiments.



View larger version (34K):
[in this window]
[in a new window]
 
FIGURE 1  Overall schematic of ASTRO-FOLD approach for three-dimensional structure prediction of proteins.

 

    {alpha}-HELIX PREDICTION
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
As a first step in the ASTRO-FOLD prediction framework, the principles of hierarchical folding are used to develop a method for the prediction of {alpha}-helices in protein systems (Klepeis and Floudas, 2002Go). The suitability of this method for {alpha}-helix determination is based on observations that nativelike segments of helical secondary structure form rapidly. The ability for helices to fold rapidly suggests that {alpha}-helix formation can occur during the earliest stages of protein folding. Such a mechanism for the helix-coil transition is based on local interactions which induce nucleation and propagation of the helix (Honig and Yang, 1995Go).

To isolate the prediction of {alpha}-helical elements to only local interactions, the overall protein is typically segmented into overlapping pentapeptide segments. In principle, longer peptide segments could be employed. As a minimum, a length of five residues is chosen because of the ability to observe the nucleation of a three-residue helix core within the fragment, which allows for the formation of a stabilizing backbone hydrogen bond. For a protein with N residues, the decomposition of the overall protein sequence into five residue segments corresponds to the analyses of a total of N-4 pentapeptides. For larger oligopeptides, such as heptapeptides or nonapeptides, the result would be either N-6 or N-8 subsequences, respectively. Theoretically, any fragments less than the total length of the target sequence could be simulated, although in practice the largest length should be the longest observed length for contiguous helices.

The helical propensity for an individual oligopeptide is determined through rigorous probability calculations using detailed atomistic level modeling. In the current implementation, the atomistic level modeling is based on the ECEPP/3 semiempirical force-field model (Némethy et al., 1992Go). For this force field, it is assumed that the covalent bond lengths and bond angles are fixed at their equilibrium values, so that the conformation is only a function of the independent torsional angles of the system. The total force field energy, Eforcefield, is calculated as the sum of electrostatic, nonbonded, hydrogen-bonded, and torsional contributions. The main energy contributions (electrostatic, nonbonded, hydrogen-bonded) are computed as the sum of terms for each atom pair (i,j) whose interatomic distance is a function of at least one dihedral angle.

With the amino-acid subsequences defined and the energy model selected, the ability to predict the preferred (or native) conformation for a given peptide translates into a global optimization problem in which the goal is to identify the global minimum free energy of the system. A wide variety of methods exist for tackling this problem, although generic guarantees for finding a global minimum energy conformation, even for pentapeptides, are not available. A deterministically-based global optimization method has been identified for solving these problems, although other stochastic methods have also been shown to be efficient (Floudas et al., 1998Go).

An additional consideration must be made when evaluating the entropic contributions for these systems. These entropic effects are necessary for calculating free energies (Klepeis and Floudas, 1999Go), which are the true gauges of conformational stability at equilibrium. A number of methods based on conformational sampling have been developed to approximate these effects. In this work, information regarding metastable states of the system is used in conjunction with the harmonic approximation to determine the accessibility of a given metastable state (Klepeis and Floudas, 1999Go). An attractive consequence of this approach is the identification of a significant ensemble of low free energy conformations, including the global minimum free energy structure. Rather than rely on the prediction of a single conformer (i.e., the global minimum free energy), the ensemble can be used to calculate occupational probabilities for representative conformational state of the system.

The analysis of the free energy of these oligopeptides therefore requires efficient methods for locating not only the global minimum energy structure but also large numbers of low energy conformers. A variety of methods have been used to find such stationary points on potential energy surfaces. For example, periodic quenching during a Monte Carlo or molecular dynamics trajectory can be used to identify local minima (Stillinger and Weber, 1988Go). In this work two algorithms are advocated for generating low energy ensembles for oligopeptide sequences. The first is based on modifications of a deterministic branch-and-bound algorithm, {alpha}BB (Adjiman et al., 1998aGo,bGo; Floudas, 2000Go; Klepeis et al., 1998Go, 2002Go; Klepeis and Floudas, 1999Go). The second is based on the principles of conformation space annealing (CSA) (Lee et al., 2000Go, 1998Go, 1997Go; Lee and Scheraga, 1999Go; Ripoll et al., 1998Go), an efficient yet stochastic method that does not provide the deterministic guarantees of the {alpha}BB approach. The implementation of the CSA-based method involves the combination of genetic algorithms and simulated annealing protocols (Lee et al., 1997Go). The details of these two approaches will be given in a later section.

Once an ensemble of low energy conformations (along with the global minimum energy conformation) has been identified for each oligopeptide, the free energy of each unique conformer is evaluated at 298 K using the harmonic approximation for entropic effects, and these free energy values are used to calculate individual occupational probabilities for each metastable state. Clustering of these states is based on the classification of the backbone torsion angles of the core residues. Specifically, the probabilities of conformers exhibiting identical Zimmerman codes for the core residues are summed and ranked to provide an ordered list of conformational propensities. The first iteration of the helix prediction approach is used to identify {alpha}-helical clusters for neutral oligopeptides. When the probability of the {alpha}-helical cluster (AAA for core residues of a helical pentapeptide) is greater than ~85–90% for more than three consecutive sets of core residues, the marked oligopeptides are considered for further analysis.

For those subsequences including ionizable residues, {alpha}-helix propensities are refined according to detailed electrostatic and ionization energy calculations obtained through the solution of the Poisson-Boltzmann equation. Specifically, for the set of potential {alpha}-helical pentapeptides containing ionizable residues, probabilities are recalculated for a subset of conformers using a combination of the vacuum free energy calculations at 298 K and polarization and ionization free energies at pH 7. The final {alpha}-helix propensity for each residue are assigned according to the average AAA probability.


    ß-SHEET PREDICTION
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Once the locations of {alpha}-helices have been predicted, the remaining residues are further analyzed to simultaneously identify and predict the location of ß-strands and ß-sheets. The procedure relies on hydrophobic information and the prediction of tertiary hydrophobic contacts to identify parallel and antiparallel ß-sheets (Klepeis and Floudas, 2003bGo), as well as the location of disulfide contacts. In addition, the approach can identify a rank-ordered list of competitive ß-sheet arrangements.

The principal feature of the ß-sheet prediction is the modeling of the desolvation forces that govern hydrophobic collapse. The importance of the hydrophobic collapse, rather than just hydrogen bonding forces, in the formation of ß-sheets has received growing theoretical support. One controversy regarding the validity of this hypothesis extends from the debate over hierarchical folding. In the case of hierarchical folding, it is believed that the ß-sheet nucleates at the hairpin turn and proceeds to form through a zippering model that includes stabilization through hydrogen-bond formation (Munoz et al., 1997Go). The alternative view promotes a model in which ß-sheet formation is driven by the hydrophobic collapse and is independent of hydrogen-bond formation. Recent simulations have demonstrated and supported the dominant role of hydrophobic forces in driving ß-sheet formation (Bryant et al., 2000Go; Dinner et al., 1999Go; Pande and Rokhsar, 1999Go; Westerberg and Floudas, unpublished results).

The modeling of hydrophobic interactions between ß-strand residues is used to formulate several problems that are globally optimized to produce a rank-ordered list of ß-sheet arrangements with decreasing hydrophobic interaction energies. Each formulation produces an integer linear programming (ILP) problem in which the hydrophobic contact energy must be maximized.

The first formulation, a residue-to-residue contact problem, is based on iterative solution to effectively build an optimal set of hydrophobic contacts. The set of hydrophobic residues

in the target sequence is identified, and the relative positions of these residues are used to define a position-dependent parameter, P(i). Hydrophobicity parameters (Hi) are assigned to each residue based on experimentally derived free energy of transfer of amino acids from organic solvents to water (Karplus, 1997Go; Lesser and Rose, 1990Go; Radzicka and Wolfenden, 1988Go). The interaction energy for a potential hydrophobic contact is assumed to be additive, and the possible formations of these contacts are represented by binary variables, yij.

The objective function for the ILP formulation becomes

(1)

(2)
For cystine-to-cystine contacts, an additional energy contribution is based on the addition of the interaction energy for all hydrophobic residues between the potential disulfide bonding pair. The contribution is normalized based on the length of the intervening segment.

(3)
A number of constraints are included in the formulation to provide physically consistent arrangements.

(4)
The above constraint requires that at least one contact must form in which at least seven residues fall between residues i and j.

(5)
Here Ni represents the total number of possible contacts for hydrophobic residue i. Initially this value is set equal to 1 for all residues so that each residue may participate in only one contact.

The next set of constraints define the allowable ß-sheet configurations. For the case of antiparallel ß-sheet formation, symmetric nonintersecting loops must be enforced. The following constraints provide the necessary conditions for antiparallel ß-sheet formation:

(6)
The set of conditions implies that the sum of the contact position parameters must be equal and cannot be intersecting. In addition, the constraint is not included if a potential contact, either ij or kl, represents the formation of a disulfide bridge. In this way, disulfide bridge contacts are allowed to form nonsymmetric, intersecting loops. For the case of parallel ß-sheet formation, the contacts must involve symmetric intersecting loops, which provides a similar set of constraints.

Finally, when disulfide bridge formation is allowed, an inequality constraint can be used to set the maximum allowable number of cystine-to-cystine contacts, such that the parameter NSS represents an upper bound on the possible number of disulfide bridges.

The resulting ILP must be solved to global optimality to identify the set of contacts which maximize the hydrophobic interaction energy defined by the objective function. In general, MILP formulations belong to the class of NP complete problems (Floudas, 1995Go; Nemhausser and Wolsey, 1988Go), and available codes typically employ a branch-and-bound technique to find the optimal integer solution through the identification of a sequence of related LP relaxations.

The optimal set of hydrophobic residue-to-residue contacts is generated by solving the ILP formulation iteratively, with the identification of disulfide bonding pairs being included during the first iteration. For each subsequent iteration, the global optimal solution can be identified along with a rank-ordered list of possible antiparallel or parallel ß-sheet configurations.

Three alternative formulations rely on the identification of potential ß-strands, rather than just individual hydrophobic residues. These potential ß-strands can be used to solve residue-to-residue contact or strand-to-strand contact ILP problems, or a combination of the two. The benefit of these approaches is the ability to identify the full ß-sheet configuration simultaneously. Except for those simple systems that can be studied in detail using the residue-to-residue contact formulation, the strand-to-strand formulations represent the preferred technique for ß-sheet predictions.

First, a protocol to identify potential ß-strands is applied (Klepeis and Floudas, 2003bGo). This set of postulated strands represents a superstructure since the protocol typically leads to an overprediction of the true number of ß-strands. In other words, the solution of the hydrophobic contact formulation may exclude certain postulated ß-strands from the predicted ß-sheet topology. Once the potential strands have been identified, each strand is assigned a position-dependent parameter, Q(si). The parameter is equal to the strand number by counting sequentially from the N-terminus to the C-terminus of the sequence. Each strand is also described by a start and end residue whose positions are denoted as Pbeg(si) and Pend(si), respectively, and the number of hydrophobic residues within the strand is defined by NH(si). A hydrophobicity value is assigned to each strand (Ssi), according to the average hydrophobicity of the hydrophobic residues in that strand:

(7)
The objective function for the strand-to-strand ILP formulation becomes that of maximizing the hydrophobicity by identifying the activation of those binary wsi,sj variables representing a particular strand-to-strand contact. The hydrophobicity gained by a particular contact is assumed to be an additive combination of the Ssi and Ssj hydrophobicity values.

(8)

(9)
A number of constraints are included in the formulation to provide physically realistic strand arrangements.

(10)
Here NSsi represents the total number of possible contacts for strand si. In general, this value is set equal to 2 for all strands, although the proximity of helices may require a reduction of this value to 1. Another general constraint can be used to turn off certain disallowed strand-to-strand contacts.

(11)
A particular strand-to-strand contact is disallowed when the DSsi,sj parameter is set to zero for that combination.

Additional sets of constraints can be used to impose a maximum value on the number of consecutive strand-to-strand matches and to disallow more than one strand-to-strand match from one side of strand si. In addition, to maintain physically meaningful configurations the formation of ß-sheet topologies with double intersecting strand-to-strand contacts are also disallowed (Klepeis and Floudas, 2003bGo).

A second formulation uses these strand definitions to solve the full ß-sheet configuration problem. The objective function is based on the residue-to-residue contact energies, as given by Formulation 1. The set of constraints defined by Eqs. 4 and 5 are included in the formulation, and the constraints enforcing the formation of antisymmetric and symmetric loops are relaxed to include only individual strand pairings. The constraints included in the strand configuration problem are also enforced in this formulation. Finally, connections between strands and residue contacts are provided by the following set of constraints.

(12)
These constraints link the yij and wsi,sj variables and can be complemented by additional constraints that serve to enhance the performance of the solver through tightening of the feasible search space. The linked problem can be thought of as a bilevel optimization problem in which the inner problem represents the maximization of given strand-to-strand contact registration, while these solutions are then suitable for an outer optimization problem that maximizes the hydrophobicity of the overall strand-to-strand arrangement.

A third and final formulation combines the objective functions from both the residue-to-residue and strand-to-strand formulations. This allows both contact energies to influence the prediction of the ß-sheet configuration through the following objective function:

(13)
As this formulation combines both strand and residue contact terms, the constraint set is identical to that of the previous formulation. It is important to note that multiple global solutions are also possible and that these ß-sheet configurations may include both different strand-to-strand contacts as well as identical strand-to-strand contacts with varying strand-to-strand registrations. A full set of competitive solutions can be identified using integer cut constraints and iterative solution of the ILP formulations.


    RESTRAINTS AND LOOP MODELING
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Before progressing to the final stage of the ASTRO-FOLD approach, which involves the prediction of the tertiary structure of the full protein (Klepeis and Floudas, 2003aGo), the structure prediction problem is formulated based on the development of atomic distance and dihedral-angle restraints derived from the {alpha}-helix and ß-sheet prediction results. In its final form, this formulation requires the use of constrained nonlinear global optimization techniques. This problem is solved through a combination of the deterministically-based {alpha}BB global optimization approach, stochastic global optimization, and molecular dynamics in torsion-angle space (Klepeis and Floudas, 2003aGo).

First, dihedral-angle bounds are assigned according to the predicted structure class, {alpha}-helical, ß-strand, or unassigned, for each residue. The corresponding bounds on the values of the backbone torsion angles are given in Table 1. For {alpha}-helices, C{alpha}-C{alpha} distances can be restrained between each pair of i and i + 4 residues, which anticipates the formation of the {alpha}-helix hydrogen bond network. In a similar fashion, C{alpha}-C{alpha} restraints can be developed for residues in opposing strands of a ß-sheet fold, so that hydrogen bond formation between strands is enforced. The ß-strand restraints include both hydrophobic residues and intervening residues over the full extent of the matching strands in the ß-sheet. The corresponding upper and lower distance limits are given in Table 2.


View this table:
[in this window]
[in a new window]
 
TABLE 1  Dihedral angle bounds, lower and upper, for {alpha}-helix and ß-strand residues

 

View this table:
[in this window]
[in a new window]
 
TABLE 2  C{alpha}-C{alpha} distance bounds, lower and upper, for {alpha}-helix and ß-strand residues

 
Additional restraints can be generated through analysis of the unassigned loop residues in the protein sequence. Loops, those segments which connect elements of secondary structure in the protein fold, are often exposed or surfacial features of the protein structure. As a result, these segments can be important for defining differences in binding and activity characteristics for a fold family because functional variability is often related to the structural differences in the exposed regions.

Exploring the conformational space of a loop segment is a difficult undertaking given the large structural variability often observed in the loop regions of experimentally determined protein structures. For example, it is not unusual for loop fragments with the identical seven- or nine-residue sequence to exhibit highly dissimilar structures. These difficulties are compounded by the typically low sequence identities among the loop segments, which makes the application of comparative modeling techniques often inaccurate.

In this work, the prediction of loop conformations is treated in a manner similar to physics-based ab initio protein structure prediction (Klepeis and Floudas, 2003aGo). The goal of the approach is to aid in the ab initio treatment of the overall protein folding problem using only minimal information regarding the structure of the residues that flank the loop segment. Most importantly, an inherent assumption common to existing loop models—that is, the requirement of fixing the flanking and terminal loop residue positions—is not imposed.

Loop modeling follows the identification of secondary structure elements and ß-sheet topology from the first two stages of the ASTRO-FOLD approach (Klepeis and Floudas, unpublished data). In the absence of threading such predictions onto a structural template, these results merely define the sequence (and not structural) location of loop fragments between two consecutive segments of secondary structure. The applied optimization approaches aim at deriving additional restraints, such as distance and torsional restraints, through systematic analysis of detailed all-atom, free energy simulations. The first method relies on the ability of local conformational preferences to define loop conformations and, in this spirit, a series of free energy calculations are performed for the overlapping oligopeptides (including portions of the flanking units) defining the loop segment. Structural probabilities are built for the dihedral angle space and used to define reduced bounds for subsequent simulations of larger portions of the loop fragment. This methodology culminates in a free energy simulation of the entire loop fragment. A second approach begins by dissecting the distance space over larger segments of the loop fragment such that longer range loop interactions are included. Distance domains are enforced via nonconvex constraints in the torsion space and simulations are conducted for all combinations of the dissected domains. Consolidation of the simulation results is used to define appropriate distance bounds to be imposed during a simulation of the overall loop fragment. In their final stages, both approaches provide energy-based predictions from models of the entire loop segment, although the progression of each approach is based on the emphasis of different structural descriptors.

These approaches play an important role in restraining and focusing the conformational searches used in treating the overall three-dimensional structure prediction problem. In particular, these restraints take the form of reduced {phi}- and {psi}-domains as well as internal interatomic distance restraints for those residues connecting consecutive elements of secondary structure. The bounds are extracted from the set of low free energy conformers identified for oligopeptides representing these loop residue segments.


    TERTIARY STRUCTURE PREDICTION
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Once appropriate bounds on dihedral angles and interatomic distances have been determined, a combination of the deterministically-based {alpha}BB global optimization algorithm, stochastic global optimization, and molecular dynamics in torsion-angle space is used to solve a constrained tertiary structure prediction problem (Klepeis and Floudas, 2003aGo). The basic formulation begins as

(14)
Here the {phi} represents the variables used to describe protein conformations in the torsion-angle space, while {phi}L and {phi}U indicate the lower and upper bounds on these variables (which include both backbone and side-chain degrees of freedom). The energy function, Eforcefield({phi}), is based on the atomistic level ECEPP/3 force field. This detailed energy modeling greatly increases the complexity of the objective function, as does the transformation from Cartesian to internal coordinates. However, one advantage of working in dihedral angle space is the reduction in the dimensionality of the independent variable set.

To solve the formulation given by Eq. 14 powerful global optimization-based search techniques must be utilized. Although many such methods have been developed, the major limitation is that the majority of the methods depend strongly on heuristics and initial point selection. To circumvent such difficulties, deterministically-based global optimization approaches can be employed. One such powerful method, the {alpha}BB global optimization approach (Adjiman et al., 1996Go, 1997Go, 1998aGo,bGo; Androulakis et al., 1995Go), has been extended to identifying global minimum energy conformations of peptides. This particular branch-and-bound method provides guarantees of convergence to the global minimum of nonlinear optimization problems with twice-differentiable functions (Floudas, 1997Go, 2000Go). The application of the {alpha}BB to the minimization of potential energy functions was first introduced for microclusters (Maranas and Floudas, 1992Go, 1993Go), and small acyclic molecules (Maranas and Floudas, 1994aGo,bGo). The {alpha}BB approach has also been applied to general constrained optimization problems (Adjiman et al., 1996Go, 1998aGo,bGo; Androulakis et al., 1995Go). In more recent work, the algorithm has been shown to be successful for isolated peptide systems using the ECEPP/3 potential energy model (Androulakis et al., 1997Go; Maranas et al., 1996Go), and including several solvation models (Klepeis et al., 1998Go; Klepeis and Floudas, 1999Go). {alpha}BB-based global optimization techniques have also been applied to NMR-type structure prediction problems (Eyrich et al., 1999aGo; Klepeis et al., 1999Go; Standley et al., 1999Go).

The {alpha}BB global optimization approach effectively brackets the global minimum by developing converging sequences of lower and upper bounds. These bounds are refined by iteratively partitioning the initial domain. Upper bounds on the global minimum are obtained by local minimizations of the original nonconvex problem. Lower bounds belong to the set of solutions of the convex lower bounding problems, which are constructed by augmenting the objective and constraint functions through the addition of separable quadratic terms. The lower bounding formulation can be expressed in the following manner:

(15)
In this formulation, variable bounds are specific to the subdomain for which the lower bounding functions are constructed. Lforcefield refers to the convex representation of the objective function, as given by

(16)
The {alpha}-parameters represent nonnegative parameters which must be greater or equal to the negative one-half of the minimum eigenvalue of the Hessian of Eforcefield over the defined domain. Rigorous bounds on the {alpha}-parameters can be obtained through a variety of approaches (Adjiman et al., 1998aGo,bGo; Adjiman and Floudas, 1996Go; Hertz et al., 1999Go; Maranas and Floudas, 1994aGo). The overall effect of these terms is to overpower the nonconvexities of the original terms by adding the value of 2{alpha} to the eigenvalues of the Hessian of Eforcefield.

The same {alpha}BB principles can also be applied to more general formulations involving nonlinear constraint sets. Traditionally, restraints in the form of torsion-angle and interatomic-distance bounds (as derived in the previous stages) are formulated as unconstrained energy minimization problems. The lower and upper bounds on the torsion angles and interatomic distances are imposed through the use of weighted penalty terms that are minimized to zero while subsequently minimizing an overall energy objective. However, when reformulating these restraints as independent nonlinear constraints, both the Edihedral and Edistance penalty terms are removed from the target function, leaving only Eforcefield:

(17)
As before, i = 1,...,N{phi} corresponds to the set of dihedral angles, {phi}i, with and representing lower and upper bounds on these dihedral angles. In general, the lower and upper bounds for these variables reflect full rotation although the reduced bounds (derived from the previous stage of the ASTRO-FOLD approach) are equally suitable. are reference parameters for the NCON constraints. The set of constraints are completely general, and can represent either the full combination of distance restraints or smaller subsets of the defined distance restraints. The maximum and average violation for each structural element can be controlled separately through the use of individual constraints, while an overall constraint including all distance can also be enforced to limit the violations over the entire structure.

These constraints, through reduction of the feasible search space, help to correct any discrepancies in the energy model, as well as focus the efforts of the global optimization algorithm. However, the highly nonlinear form of the potential energy function, coupled with the nonconvexities of the constraints, substantially increases the difficulty in identifying low energy feasible points for the {alpha}BB approach. To alleviate these difficulties a relatively fast torsion-angle dynamics module is implemented as a preprocessing step to the local minimization of the upper bounding problem. As a result, the performance of the {alpha}BB approach is improved significantly through the rapid determination of good approximations to feasible low energy minima.

Once the ability to formulate and effectively solve the upper and lower bounding problems has been established, the next step is to modify these problems for the next iteration. This is accomplished by successively partitioning the initial domain into smaller subdomains. For the protein conformation problems, it has been found that an effective partitioning strategy involves bisecting the same variable dimension across all nodes at a given level. To ensure nondecreasing lower bounds, the hyper-rectangle to be bisected is chosen by selecting the region which contains the infimum of the minima of lower bounds. A nonincreasing sequence for the upper bound is found by solving the nonconvex problem locally and selecting it to be the minimum among all conformers in the upper bound list. If the single minimum of Lforcefield for any hyper-rectangle is greater than the current upper bound, the global minimum cannot exist within this region and the entire subdomain can be deleted from the list of searchable regions (fathoming step). The computational requirement of the {alpha}BB algorithm depends on the number of variables (global) on which branching occurs.

The use of the {alpha}BB method is also amenable to the integration of other stochastic or heuristic search techniques for enhancing and improving the identification of low energy conformations. In other words, the solution of the upper bounding problem (i.e., the original nonconvex problem) is not limited to the use of nonlinear local minimization techniques. Such methods are known as hybrid global optimization methods and the ultimate goal of these methods is to combine the beneficial features of two or more algorithms. In particular, novel classes of hybrid global optimization methods, termed alternating hybrids, have been recently introduced for application as a tool in treating the protein structure prediction problems (Klepeis et al., 2003aGo,bGo). These new optimization methods take the form of hybrids between the deterministic global optimization algorithm, the {alpha}BB (Adjiman et al., 1998aGo,bGo; Floudas, 2000Go; Klepeis et al., 1998Go, 1999Go, 2002Go; Klepeis and Floudas, 1999Go), and a stochastically-based method, conformational space annealing (CSA) (Lee et al., 1997Go, 1998Go, 2000Go; Lee and Scheraga, 1999Go; Ripoll et al., 1998Go). The {alpha}BB method, as a theoretically proven global optimization approach, exhibits consistency, as it guarantees convergence to the global minimum for twice-continuously differentiable constrained nonlinear programming problems, but can benefit from enhanced computational efficiency. On the other hand, the independent CSA algorithm is highly efficient, though the method lacks theoretical guarantees of convergence. Furthermore, both the {alpha}BB method and the CSA method are found to identify ensembles of low-energy conformers, an important feature for determining the true free energy minimum of the system.

The CSA algorithm itself is a hybrid global optimization algorithm that combines genetic and simulated annealing algorithms (Kirkpatrick et al., 1983Go). The fundamental precept of the CSA algorithm is to anneal within the conformation space to converge upon the global minimum energy conformer. Initially, the entire conformation space is accessible, but as the algorithm proceeds the search region collapses around the lowest energy conformers. The process for reducing the search space is based on the concepts of simulated annealing, while the search for low energy conformers is influenced by the ideas of genetic algorithms.

To implement the CSA, the search begins with a bank of Nbank conformers generated randomly throughout the torsion-angle space. The separation between these conformers is quantified according to their pairwise deviation in torsion-angle space,

(18)
where N{phi} is the number of torsion angles in the protein, and is the value of torsion angle k in conformer i. At the start of the algorithm each conformer in the bank represents a region of conformation space with radius Dcut,o and centered on the point. The value of Dcut,o is calculated to be the average deviation among all conformers:

(19)
where Dij is the deviation given by Eq. 18. The value of Dcut is annealed according to an exponential schedule to reduce Dcut,o to a small value after a set number of iterations.

As the value of Dcut is annealed the conformation space is also searched according to a set of heuristics. These heuristics involve the alteration of variable values in a seed conformation according to random and crossover-based criteria. The mutated conformers are subjected to local minimization and then rejected or inserted into the current bank of active conformers with the stipulation that the size of the bank remains unchanged. Several scenarios are possible following the local minimization of the mutated conformer. In all cases, if the energy is above the highest energy conformer in the bank, the conformer is rejected and requires no further analysis. Otherwise, the value for Dij is calculated for the combinations of the mutated conformers and those conformers in the bank. If the value of Dij is greater than the current value for Dcut for all conformers, then the conformers are inserted in the bank and the highest energy conformer is removed to maintain the size of the bank. If the conformer falls within a defined region, then the conformer can be used to redefine the region if the energy is lower than the conformer already describing this region.

At each iteration a set number of mutations are performed before further reducing the value of Dcut. Each set of mutations is performed for one seed conformation taken from the bank. The seed conformations are chosen so that each conformer is not selected more than once until all conformers in the bank have been selected at least once. This process is repeated for a set number of iterations. In total, four types of mutations are performed, including both random and crossover-based substitutions using different sets of independent and connected variables.

With regard to the {alpha}BB/CSA hybrids, the algorithm alternates between large blocks of {alpha}BB iterations and large blocks of CSA iterations. In other words, for the hybrid global optimization approach, the {alpha}BB and CSA portions of the algorithm are not integrated (that is, one iteration of {alpha}BB is not followed by one iteration of CSA), but rather the two sides of the hybrid take turns dominating the behavior of the algorithm. First, the {alpha}BB branch-and-bound tree is set up, and the {alpha}BB portion of the algorithm is run for Nbank iterations. At each iteration, one of the local minima of the potential energy function generated in solving the upper-bounding problem is stored in a queue. Once Nbank iterations are complete, the queue is emptied into the initial CSA bank. At this point, the {alpha}BB algorithm shuts down temporarily, and the CSA portion of the hybrid takes over. One conformation is withdrawn at random from the CSA bank to serve as the seed conformation, and the offspring generated from this conformation are subjected to local minimization and entered into the bank (if applicable). This process is repeated for NCSA iterations (with restrictions on the choice of a seed to ensure that every element in the bank is chosen once as a seed before any element is chosen a second time).

At this point, if the global optimum has not been located, the CSA portion of the algorithm shuts down temporarily, and control returns to the {alpha}BB portion. This proceeds through Nadd more iterations to produce Nadd more local minima. These minima are then added to the CSA bank, thus increasing its size by Nadd. Control then returns to the CSA portion of the algorithm, and the cycle repeats. Care is taken to ensure that all of the new minima added to the CSA bank are used as seed conformations at least once before any of the conformers that were already in the bank are again selected as seed conformations.


    COMPUTATIONAL COMPLEXITY
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The application of the ASTRO-FOLD approach to the structure prediction of medium-sized proteins is made possible through the use of a distributed computing environment. Since the prediction approach is hierarchical in nature, the parallelized implementation is customized both to the type of problem being solved and the algorithm being employed at each stage of the approach.

First, the prediction of {alpha}-helices requires the decomposition of the full protein into smaller segments, and this approach is amenable to parallelization due to the independent nature of the analyses for the overlapping subsequences. The major computational expense for the {alpha}-helix prediction stage involves the calculation of accurate solvation and ionization energies for a subset of the overall oligopeptides (Klepeis and Floudas, 2002Go). The computational effort depends strongly on the number of times the Poisson-Boltzmann equation solver must be invoked, and is a function of the number of ionizable groups. In all cases, a finite difference solution to the Poisson-Boltzmann equation is implemented through the DELPHI package (Gilson and Honig, 1988Go; Honig and Nicholls, 1995Go). The calculation of solvation energy, namely the polarization energy of the neutral system, requires two calls to DELPHI, and is independent of the number of titratable groups in the system. For each ionizable group six additional DELPHI calls are required, which correspond to four reaction field calculations and two permanent dipole calculations. Two of the six calculations involve only single residue conformations, rather than the full protein system. When multiple titratable groups are present, four additional DELPHI calls must be made for each pair of ionizable groups. The computational effort in terms of the number of required DELPHI calls varies according to 2(N + 1)2, where N is the number of ionizable groups (Klepeis and Floudas, 2002Go). These calculations are performed in parallel using a variation on a ring-based architecture system; in other words, processors communicate with nearest neighbors without the need for a master processor. After evenly dividing the static workload among all processors, load balancing is achieved through nearest-neighbor communication between processors in the ring structure. Termination is detected once the appropriate token has completed a full cycle through all idle processors.

For a given oligopeptide, the set of DELPHI calls is performed for an ensemble of the lowest free energy conformers. In the typical case of 5000 oligopeptide conformers, the total CPU requirement is on the order of ~0.5 wallclock hour on a 128 parallel processor machine. However, the computational requirements are dependent on the specific size and charge distribution of the system. When considering systems with multiple titration sites, the computational cost increases significantly. For a two-titratable group oligopeptide, ~1.5 wallclock hours are needed on the same parallel machine, whereas ~3 wallclock hours are required for a system with three ionizable groups. These values can be used to estimate the total time to calculate free energies for oligopeptides of larger protein systems. For the in vacuo free energy calculations the total wallclock time will always be ~6 h as long as the number of processors exceeds the total number of oligopeptides, since each oligopeptide can then be processed sequentially. When considering the DELPHI calculations, although the number of calculations varies linearly, the actual computational time varies according to the number of residues with titratable side chains and their occurrence in the set of oligopeptides. For a 100-residue sequence with average composition, a total of two wallclock days on a 128 parallel processor is needed to complete the ab initio prediction of helices.

After the identification of helices using rigorous free energy calculations, a combinatorial optimization problem is solved as part of the second stage of the ASTRO-FOLD approach. In particular, this second stage, which involves the prediction of ß-sheet configurations, necessitates the optimization of several integer linear optimization, ILP, problems. In this work, a powerful software package, CPLEX (CPLEX, 1997Go), is used to identify globally optimal ILP solutions. The computational effort requires ~1–2 h on a single processor, while simultaneously providing a rank-ordered list of competitive ß-sheet topologies.

The final and computationally most expensive part of the ASTRO-FOLD approach is the solution of the constrained tertiary structure problem to produce a complete prediction of the three-dimensional structure of the target sequence. The nature of the alternating hybrid algorithm is also especially suitable to parallelization. Because the {alpha}BB and CSA elements of the algorithm are essentially totally separate, several plausible parallelization schemes present themselves, and these have been tested elsewhere (Klepeis et al., 2003aGo). The basic premise involves setting up two "master" nodes—an {alpha}BB master and a CSA master node. The slave nodes are then dedicated to one of these two masters—that is, a given slave node would perform either only {alpha}BB iterations, or only CSA iterations. Under this setup, while the CSA nodes are carrying out generation of trail conformations and bank updates, the {alpha}BB nodes could be working independently to solve enough lower-bounding problems to prepare for the next required update of the CSA bank. The parallel hybrid algorithm provides an efficient method for tackling the full structure prediction problem within the framework of the {alpha}BB deterministic global optimization approach.

Several factors affect the computational requirements for solving this constrained tertiary structure prediction problem. Most notable are the form of the energetic model, the form of the constraint functions, and the number of global variables for the system. For a system of ~100 residues, the tertiary structure prediction phase, solved using the parallel hybrid algorithm described above, requires three wall clock days of CPU time on a 128 parallel processor distributed computing environment.


    COMPUTATIONAL STUDIES
 TOP
 ABSTRACT
 INTRODUCTION
 {alpha}-HELIX PREDICTION
 ß-SHEET PREDICTION
 RESTRAINTS AND LOOP MODELING
 TERTIARY STRUCTURE PREDICTION
 COMPUTATIONAL COMPLEXITY
 COMPUTATIONAL STUDIES
 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The ASTRO-FOLD methodology has been validated for several benchmark protein systems, as well as an ex post facto analysis for several CASP3 and CASP4 targets. A detailed analysis of blind CASP5 prediction results is presented in the subsection below. In addition, a significantly larger number of examples have been studied independently for both the {alpha}-helix (Klepeis and Floudas, 2002Go) and ß-sheet (Klepeis and Floudas, 2003bGo) stages of the approach. Tables 3 and 4 provide a comparison of predicted and experimental results for eight computational studies after complete application of the ASTRO-FOLD approach.


<