help button home button Biophys. J.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Petoukhov, M. V.
Right arrow Articles by Svergun, D. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Petoukhov, M. V.
Right arrow Articles by Svergun, D. I.

Biophys J, December 2002, p. 3113-3125, Vol. 83, No. 6

Addition of Missing Loops and Domains to Protein Models by X-Ray Solution Scattering

Maxim V. Petoukhov,*dagger Nigel A. J. Eady,Dagger Katherine A. Brown,Dagger and Dmitri I. Svergun*§

 *European Molecular Biology Laboratory, Hamburg Outstation, D-22603 Hamburg, Germany;  dagger Physics Department, Moscow State University, 117234 Moscow, Russia;  Dagger Department of Biological Sciences, Centre for Molecular Microbiology and Infection, Imperial College of Science, Technology and Medicine, London SW7 2AY, United Kingdom; and  §Institute of Crystallography, Russian Academy of Sciences, 117333 Moscow, Russia


    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

Inherent flexibility and conformational heterogeneity in proteins can often result in the absence of loops and even entire domains in structures determined by x-ray crystallographic or NMR methods. X-ray solution scattering offers the possibility of obtaining complementary information regarding the structures of these disordered protein regions. Methods are presented for adding missing loops or domains by fixing a known structure and building the unknown regions to fit the experimental scattering data obtained from the entire particle. Simulated annealing was used to minimize a scoring function containing the discrepancy between the experimental and calculated patterns and the relevant penalty terms. In low-resolution models where interface location between known and unknown parts is not available, a gas of dummy residues represents the missing domain. In high-resolution models where the interface is known, loops or domains are represented as interconnected chains (or ensembles of residues with spring forces between the Calpha atoms), attached to known position(s) in the available structure. Native-like folds of missing fragments can be obtained by imposing residue-specific constraints. After validation in simulated examples, the methods have been applied to add missing loops or domains to several proteins where partial structures were available.


    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

Protein function is related not only to the three-dimensional arrangement of polypeptide chains but also to their intrinsic mobility. Techniques such as x-ray crystallography and NMR can yield high-resolution information regarding the positions of individual atomic groups within a macromolecule, but flexible or disordered regions may appear to be absent. Such regions may be of significant functional importance and can include, for example, a loop in an enzyme active site, a receptor-binding motif, or an antigenic epitope. In large multi-domain proteins, inherent flexibility between domains can prevent successful crystallization, and in these cases crystallographic or NMR data may be limited to studies of individual domains produced by genetic or proteolytic methods. However, it is apparent that complementary approaches are required to analyze the structure of intact multi-domain proteins and assemblies, especially in view of recent initiatives aimed at large-scale expression and purification of proteins for subsequent structure determination (e.g., Edwards et al., 2000).

One such approach is small-angle x-ray scattering (SAXS) (Feigin and Svergun, 1987). This technique can yield structural information about macromolecules in solution with proteins from as small as 6 kDa (e.g., Sayers et al., 1999) to large macromolecular complexes such as the ribosome (Svergun and Nierhaus, 2000). SAXS patterns result from an average of the scattering from the entire ensemble of randomly oriented particles in the sample, and this lowers the resolution of the method. Nevertheless, in contrast to x-ray crystallographic analysis where flexible regions of a structure may result in poorly interpretable electron density, solution scattering patterns are sensitive to these disordered regions, yielding information about their average conformation. The SAXS method can thus provide (at low resolution) information complementary to that of crystallography and NMR. Solution scattering also permits one to construct models of multi-domain proteins and macromolecular complexes from high-resolution structures of individual domains or subunits. Rigid body modeling, successfully used by different groups (Ashton et al., 1997; Krueger et al., 1997; Svergun et al., 1997, 1998a, 2000), is an effective way to characterize complex structures. The methods to compute solution scattering patterns accurately from atomic models and rapidly evaluate scattering from complex particles are now well established (Svergun, 1994, 1995, 1998b). These methods coupled with three-dimensional display and manipulation programs allow interactive or automated searches of positional parameters to fit the experimental scattering from the complex (Konarev et al., 2001; Kozin and Svergun, 2000).

In cases where the portions of a macromolecule or complex lack a three-dimensional structure description, alternative methods are required (beyond rigid-body refinement) to generate a model. For example, the known part of the structure (either high- or low-resolution model) can be fixed, and missing portions, such as the disordered loops or domains, can be then be modeled to fit the experimental scattering data obtained from the intact particle. In the present paper, a recently proposed dummy-residues model (Svergun et al., 2001a) is further developed to construct the algorithms for complementing high- and low-resolution partial models of protein structures. Simulated annealing is used to minimize a scoring function containing the discrepancy between the experimental and calculated patterns and relevant penalty terms. Where applicable, information about the primary and secondary structure is used to restrain the model and to provide native-like conformations of the missing structural fragments.

After validation in simulated examples, the potential of this approach has been explored using three model systems. These methods have first been applied to develop models for small contiguous loops (~30-35 residues), which are absent in the crystal structures of a Drosophila motor protein (Kozielski et al., 1999) and the R2 protein of Escherichia coli ribonucleotide reductase (Logan et al., 1996). Second, reconstruction of an entire missing domain has been attempted using experimentally observed scattering data from a fusion protein. This fusion consists of Schistosoma japonicum glutathione S-transferase (GST) and E. coli dihydrofolate reductase (DHFR). Although a few examples exist of crystal structures of GST fused with relatively small fusion fragments (Lim et al., 1994; Ware et al., 1999; Zhang et al., 1998), proteins of interest are often isolated by proteolytic digestion of the linker region (Nagai and Thogersen, 1984) before structural analysis. Therefore, little is known about the conformation of the linker region or the structure of a globular protein fused to GST. Using SAXS and the reconstruction methods, new information regarding domain and linker orientations in this popular fusion system is presented. In combination, analysis of these model systems provides an insight into the possible scope of these reconstruction techniques, from small loops to multi-domain assemblies.


    MATERIALS AND METHODS
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

Dummy-residues approach

The scattering intensity I(s) from a dilute monodispersed solution of macromolecules is an isotropic function depending on the modulus of the scattering vector s = (s, Omega ), where Omega  is the solid angle in reciprocal space, s = (4pi /lambda )sintheta , lambda  is the wavelength, and 2theta is the scattering angle. The x-ray scattering intensity is proportional to the scattering from a single particle averaged over all orientations and can be expressed as:
I(s)=⟨‖A<SUB><UP>a</UP></SUB>(<B><UP>s</UP></B>)−&rgr;<SUB><UP>s</UP></SUB>A<SUB><UP>s</UP></SUB>(<B><UP>s</UP></B>)+&dgr;&rgr;<SUB><UP>b</UP></SUB>A<SUB><UP>b</UP></SUB>(<B><UP>s</UP></B>)‖<SUP>2</SUP>⟩<SUB>&OHgr;</SUB>, (1)
where Aa(s), As(s), and Ab(s) are, respectively, the scattering amplitudes from the particle in vacuo, from the excluded volume, and from the hydration shell. The electron density of the bulk solvent, rho s, may differ from that of the hydration shell, rho b, yielding a nonzero contrast for the shell delta rho b = rho b - rho s (Svergun et al., 1995).

In the method (Svergun et al., 2001a), the protein structure is represented by an ensemble of dummy residues (DRs) centered at the positions of virtual Calpha atoms. A simulated annealing (SA) procedure (Kirkpatrick et al., 1983) is used to find DR positions by fitting the experimental data and simultaneously providing a chain-compatible structure. This is achieved by minimizing a scoring function E(r) = chi 2 + Sigma alpha iPi(r) where chi 2 is the discrepancy between the experimental and calculated scattering patterns and the penalties Pi(r) restrain the solution to ensure a chain-compatible arrangement of the DRs. The weights alpha i are selected in such a way that the total penalty Sigma alpha iPi(r) yields a significant contribution (~10-50%) to E(r) at the end of the minimization. It has been demonstrated that the DR representation adequately represents solution scattering patterns up to a resolution of 0.5 nm and that the method allows an ab initio restoration of domain structures of proteins (Svergun et al., 2001a). In the present paper, the DR approach is extended to build missing domains or loops around a known part of the protein structure. Depending on the information available, four models are considered, differing by the representation of the missing portion of the structure and by the set of constraints.

Computation of the scattering intensity

The scattering intensity Imod(s) from a protein model consisting of N DRs positioned at ri is calculated as described (Svergun et al., 2001a) using Debye's formula (Debye, 1915):
I<SUB><UP>mod</UP></SUB>(s)=<LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>K</UP></UL></LIM> <LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>K</UP></UL></LIM> g<SUB><UP>i</UP></SUB>(s)g<SUB><UP>j</UP></SUB>(s) <FR><NU><UP>sin</UP> sr<SUB><UP>ij</UP></SUB></NU><DE>sr<SUB><UP>ij</UP></SUB></DE></FR>, (2)
where K = N + M and M is the number of dummy solvent atoms in the hydration shell of the particle, gi(s) is the form factor of ith residue or solvent atom, and rij = |ri - rj| is the distance between the ith and jth point. To generate the hydration shell of thickness Delta r = 0.3 nm, the most distant residue is found along each direction of a quasi-uniform angular grid of M approx  N vectors, and a solvent atom with the form factor gi(s) = (4pi r<UP><SUB>i</SUB><SUP>2</SUP></UP>/M)Delta rdelta rho b is placed 0.5 nm outside the protein. Following previously published methods (Svergun et al., 1995, 1998b), the contrast of the hydration shell is taken to be 30 e/nm3. Solvent-corrected spherically averaged scattering intensities from the amino acid residues are weighted according to their abundance in proteins, yielding an average residue form factor < g(s)> (Fig. 1 a). The DR-form factor gi(s) = < g(s)> is taken when using models that do not account for the primary structure of the protein. To account for the internal residue structure, a correction factor < c(s)> is introduced. More than 100 proteins with known structures were taken from the Protein Data Bank (PDB) (Bernstein et al., 1977), and the scattering intensities of the full-atom representations of Ifull(s) were computed by the program CRYSOL (Svergun et al., 1995) where the average ratio < c(s)> = < Ifull(s)/Imod(s)> is evaluated over the ensemble (Fig. 1 b). As demonstrated in Svergun et al. (2001a), the function < c(s)> Imod(s) yields an adequate representation of the scattering pattern of a protein up to a resolution of 0.5 nm.



View larger version (22K):
[in this window]
[in a new window]
 
FIGURE 1   (a) Form factors of the 20 amino acid residues (· · ·) and an average form factor (------); (b) Average correction factor < c(s)> for the intensity computation using individual (--- --- ---) and dummy residues (------).

For the DR models accounting for the primary structure, form factors of the individual residues (Fig. 1 a) are computed by averaging the form factors of different conformations in the PDB files. The correction factor calculated as described above for the case of DRs is presented in Fig. 1 b. Simulations performed on proteins with known structures demonstrated, not unexpectedly, that the use of individual residues yields an even better accuracy than the dummy residues.

The discrepancy chi 2 between the calculated curve and experimental data Iexp(s) measured at n points sj, j = 1, ... n is computed as:
&khgr;<SUP>2</SUP>=<FR><NU>1</NU><DE>n−1</DE></FR> <LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>n</UP></UL></LIM> <FENCE><FR><NU>&mgr;⟨c(s)⟩I<SUB><UP>mod</UP></SUB>(s<SUB><UP>j</UP></SUB>)−I<SUB><UP>exp</UP></SUB>(s<SUB><UP>j</UP></SUB>)</NU><DE>&sfgr;(s<SUB><UP>j</UP></SUB>)</DE></FR></FENCE><SUP>2</SUP>, (3)
where sigma (sj) are the experimental errors and µ is an overall scaling coefficient.

Simulated annealing protocol

For all the models described here, SA (Kirkpatrick et al., 1983) is used for global minimization of the scoring function. The main aim of this method is to perform random modifications of the system (i.e., of the current residue arrangement) by always moving to configurations that decrease the scoring function E(r) but to also occasionally move to configurations that increase E(r). The probability of accepting the latter moves decreases in the course of the minimization (the system is cooled). At the beginning, the temperature is high and the changes are almost random, whereas at the end a configuration with (nearly) minimum E(r) is reached. The algorithm is implemented in its faster simulated quenching (Ingber, 1993; Press et al., 1992) version as follows. 1) The known part of the structure is loaded and moved to the origin and remains fixed during minimization. The rest of the structure is then generated depending on the model used. A value of the goal function E(r) is computed and a high starting temperature T0 is selected. 2) A random modification (move from r to r') of the system is performed (specific ways of generation and modification of the system are considered below). 3) Positions of the solvent atoms accounting for the border solvent layer are updated if necessary and a difference Delta E = E(r') - E(r) is computed. If Delta E < 0, the move is accepted; if Delta E >=  0, the move is accepted with a probability exp(-Delta E/T). 4) Steps 2 and 3 are repeated a sufficient number of times NT to equilibrate the system, and the temperature is lowered (T' = eta T, eta  < 1) afterwards. The system is cooled until no improvement in E(r) is observed.

Types of dummy-residue models

Free dummy-residues model

This model is an extension of the original DR model (Svergun et al., 2001a) and can be used when the location of the interface between the known and missing portions of the structure is unknown. This usually takes place when a low-resolution model represents the known portion, although high-resolution models can also be used. The known part of the structure is fixed and the unknown part is represented as a gas of free DRs within a search volume (the latter is a sphere with a diameter equal to the maximum size Dmax of the entire particle). The numbers of residues in the fixed and variable parts (N0 and ND, respectively) are assumed to be available a priori, whereas the value of Dmax can be determined from the solution scattering pattern of the particle. The scoring function is:
E(<B><UP>r</UP></B>)=RF<SUP>2</SUP>+&agr;<SUB><UP>dst</UP></SUB>P<SUB><UP>dst</UP></SUB>+&agr;<SUP><UP>1</UP></SUP><SUB><UP>con</UP></SUB>P<SUP><UP>1</UP></SUP><SUB><UP>con</UP></SUB>+&agr;<SUP><UP>2</UP></SUP><SUB><UP>con</UP></SUB>P<SUP><UP>2</UP></SUP><SUB><UP>con</UP></SUB>+&agr;<SUB><UP>gyr</UP></SUB>P<SUB><UP>gyr</UP></SUB>

RF<SUP>2</SUP>=(n−1)&khgr;<SUP>2</SUP><FENCE><LIM><OP>∑</OP><LL><UP>j=1</UP></LL><UL><UP>n</UP></UL></LIM> <FENCE><FR><NU>I<SUB><UP>exp</UP></SUB>(s<SUB><UP>j</UP></SUB>)</NU><DE>&sfgr;(s<SUB><UP>j</UP></SUB>)</DE></FR></FENCE><SUP>2</SUP></FENCE><SUP>−1</SUP> (4)
Here and below, a normalized R-factor RF will enter the scoring function instead of the discrepancy to facilitate the choice of the SA parameters and the penalty weights.

The first penalty ensuring a protein-like distribution of the nearest neighbors in the model has the form introduced in Svergun et al. (2001a):
P<SUB><UP>dst</UP></SUB>=<LIM><OP>∑</OP><LL><UP>k</UP></LL></LIM> [W(R<SUB><UP>k</UP></SUB>)(N<SUB><UP>mod</UP></SUB>(R<SUB><UP>k</UP></SUB>)−⟨N(R<SUB><UP>k</UP></SUB>)⟩)]<SUP>2</SUP>, (5)
where < N(Rk)> is a histogram of the average number of Calpha atoms in a 0.1-nm-thick spherical shell surrounding a given Calpha atom as a function of the shell radius for 0 < Rk < 1 nm observed for real proteins. Nmod(Rk) is such a histogram for the model, and the weights W(Rk) are inversely proportional to the variations of < N(Rk)> (Fig. 2 a). The summation in Eq. 5 is performed over the DRs in the variable portion of the model.



View larger version (16K):
[in this window]
[in a new window]
 
FIGURE 2   Histograms of the average distributions of nearest neighbors (a) and of the Calpha -Calpha -Calpha bond angles (b) in a polypeptide chain.

The second penalty requires the model to be interconnected so that each residue has at least one neighbor at a distance of 0.38 nm:
P<SUB><UP>con</UP></SUB>=<UP>ln</UP>(N/N<SUB><UP>l</UP></SUB>), (6)
where Nl is the length of the longest interconnected fragment of the model. This penalty is applied twice: once for the entire structure and separately for the variable part.

The third penalty restrains the space occupied by the variable part whose radius of gyration Rg can approximately be estimated as R<UP><SUB>g</SUB><SUP>est</SUP></UP> ~ 3<RAD><RCD><SUP><IT>3</IT></SUP><IT>N</IT><SUB>D</SUB></RCD></RAD>. The penalty has the form:
P<SUB><UP>gyr</UP></SUB>=((R<SUP><UP>mod</UP></SUP><SUB><UP>g</UP></SUB>−R<SUP><UP>est</UP></SUP><SUB><UP>g</UP></SUB>)/R<SUP><UP>est</UP></SUP><SUB><UP>g</UP></SUB>)<SUP>2</SUP>, (7)
where R<UP><SUB>g</SUB><SUP>mod</SUP></UP> is the radius of gyration of the variable portion.

Initially, the missing DRs are randomly positioned inside the search volume but outside the fixed portion of the model. A single SA step involves relocation of a randomly selected residue to a point at a distance of 0.38 nm from another randomly selected residue in the variable portion. The penalties force the variable portion to condense to a compact chain-compatible model, and the procedure (implemented in the program CREDO) is best suited for generating low-resolution models of missing domains without using information about primary and secondary structures.

Dummy-residues model with spring forces between neighbors

This approach builds chains of DRs attached to given point(s) or residue(s) in the known part of the structure. In contrast to the previous model, it is explicitly required that the ith DR be separated by 0.38 nm from the (i + 1)th one. The scoring function is:
E(<B><UP>r</UP></B>)=RF<SUP>2</SUP>+&agr;<SUB><UP>dst</UP></SUB>P<SUB><UP>dst</UP></SUB>+&agr;<SUB><UP>spr</UP></SUB>P<SUB><UP>spr</UP></SUB>+&agr;<SUB><UP>gyr</UP></SUB>P<SUB><UP>gyr</UP></SUB>, (8)
where Pdst and Pgyr are the same penalties as in Eq. 7, but instead of the disconnectivity penalty (Pcon), spring potentials Pspr between the neighboring DRs are used:
P<SUB><UP>spr</UP></SUB>(<B><UP>r</UP></B>)=<FR><NU>1</NU><DE>ND<SUP><UP>2</UP></SUP><SUB><UP>max</UP></SUB></DE></FR> <LIM><OP>∑</OP><LL><UP>i=1</UP></LL><UL><UP>N−1</UP></UL></LIM> (‖<B><UP>r</UP></B>(i+1)−<B><UP>r</UP></B>(i)‖−0.38)<SUP>2</SUP> (9)
The first (and, if appropriate, the last) DR(s) in the variable part are required to contact the interface point(s) between the known and variable parts of the structure. The initial approximation of the variable part is randomly generated inside a sphere with radius of gyration R<UP><SUB>g</SUB><SUP>est</SUP></UP> ~ 3<RAD><RCD><SUP><IT>3</IT></SUP><IT>N</IT><SUB>D</SUB></RCD></RAD>. centered at the interface point (or between the two interface points). The SA step involves moving a randomly selected residue to an arbitrary point at a distance of 0.38 nm from one of two adjacent residues. The variable part converges to a quasi-Calpha chain attached to the given point(s) in the known structure. This algorithm (program CHADD) is useful for adding missing loops or terminal portions to high-resolution models but can also be used for missing domain restoration.

Individual-residues model with spring forces between neighbors

This model is similar to the previous one but accounts for the primary structure of the protein. Not only is the scattering intensity computed using the individual form factors, but also residue-specific information is formulated as additional penalties to further restrain the solution and to generate native-like folds of the missing loop/domain. The scoring function has the form:
E(<B><UP>r</UP></B>)=RF<SUP>2</SUP>+&agr;<SUB><UP>dst</UP></SUB>P<SUB><UP>dst</UP></SUB>+&agr;<SUB><UP>spr</UP></SUB>P<SUB><UP>spr</UP></SUB>+&agr;<SUB><UP>hyd</UP></SUB>P<SUB><UP>hyd</UP></SUB>+&agr;<SUB><UP>bur</UP></SUB>P<SUB><UP>bur</UP></SUB> (10)

+&agr;<SUB><UP>eng</UP></SUB>P<SUB><UP>eng</UP></SUB>+&agr;<SUB><UP>vol</UP></SUB>P<SUB><UP>vol</UP></SUB>+&agr;<SUB><UP>ang</UP></SUB>P<SUB><UP>ang</UP></SUB>+&agr;<SUB><UP>dih</UP></SUB>P<SUB><UP>dih</UP></SUB>,
where the penalties Pdst and Pspr are as discussed above and the additional terms contain the residue-specific information.

The two penalties accounting for the hydrophobicity of the residues are from Huang et al. (1995):
P<SUB><UP>hyd</UP></SUB>=<UP>−</UP><FR><NU>1</NU><DE>n</DE></FR> <LIM><OP>∑</OP><LL><UP>j</UP></LL></LIM> (H<SUB><UP>j</UP></SUB>−C<SUB><UP>j</UP></SUB>h<SUB><UP>j</UP></SUB>/N<SUB><UP>j</UP></SUB>) (11)

P<SUB><UP>bur</UP></SUB>=<UP>−</UP><FR><NU>1</NU><DE>n</DE></FR> <LIM><OP>∑</OP><LL><UP>j</UP></LL></LIM> B<SUB><UP>j</UP></SUB> (12)
The sums (Eqs. 11 and 12) run over the hydrophobic residues in the entire model. The penalty of Eq. 11 promotes contacts between the hydrophobic residues. Here, n is the total number of hydrophobic residues; Cj and Hj are the numbers of all contacts and nonhydrophilic contacts of the jth residue, respectively (contacting distance is assumed to be 0.73 nm); and Nj and hj are the total number and the number of nonhydrophilic residues, respectively, except for the (j - 1)th, jth, and (j + 1)th residues. The penalty of Eq. 12 forces the hydrophobic residues to be buried in the interior of the protein. Here, Bj is the number of all neighbors of jth residue except for the(j - 2)th, (j - 1)th, (j + 1)th, and (j + 2)th residues (the neighboring distance equals 1 nm).

The penalty Peng uses knowledge-based potentials to minimize the empirical free energy of the model. The interaction potentials between residues in proteins can be computed from the analysis of the PDB structures (Miyazawa and Jernigan, 1999; Sippl, 1990; Thomas and Dill, 1996). The total energy of the model is calculated as the sum over all inter-residue contacts, and the penalty has the form:
P<SUB><UP>eng</UP></SUB>=<FR><NU>1</NU><DE>N</DE></FR> <LIM><OP>∑</OP><LL><UP>i</UP></LL></LIM> <LIM><OP>∑</OP><LL><UP>j<i−1</UP></LL></LIM> U<SUB><UP>ij</UP></SUB>, (13)
where the summation is performed over the residues separated by less than 0.73 nm and the potentials Uij are tabulated in Miyazawa and Jernigan (1999) and (Thomas and Dill, 1996).

In keeping with the low resolution of the solution scattering data, the DR model describes the Calpha backbone only, and the excluded volume effects between the backbone atoms due to the penalties Pdst and Pspr do not account for the side chains. To compensate for this, pseudo-Cbeta atoms representing the side chains are introduced following the lolly-loop model (Aszodi et al., 1995). The direction of the ith Calpha -Cbeta vector depends on the positions of the (i - 1)th, ith, and (i + 1)th Calpha atoms. The Calpha -Cbeta distance and the van der Waals radius rbeta i of the pseudo-Cbeta atom depend on the type of the ith residue. The additional excluded volume effect is taken into account by minimizing the averaged cross-volume of all spheres representing the Calpha (van der Waals radius ralpha  = 0.19 nm) and pseudo-Cbeta atoms:
P<SUB><UP>vol</UP></SUB>=<FR><NU>1</NU><DE>V</DE></FR> <LIM><OP>∑</OP><LL><UP>i</UP></LL></LIM> <LIM><OP>∑</OP><LL><UP>j</UP></LL></LIM> (V<SUP><UP>&agr;&bgr;</UP></SUP><SUB><UP>ij</UP></SUB>+0.5V<SUP><UP>&bgr;&bgr;</UP></SUP><SUB><UP>ij</UP></SUB>), (14)
where V is the total excluded volume of the protein, and V<UP><SUB>ij</SUB><SUP>&agr;&bgr;</SUP></UP> and V<UP><SUB>ij</SUB><SUP>&bgr;&bgr;</SUP></UP> are the cross-volumes of the Calpha or Cbeta atom belonging to ith residue with the Cbeta atom belonging to jth residue, respectively.

The two other penalties impose restrictions on the distribution of bond and dihedral angles of the model chain. It is well known (Irbaeck et al., 1997; Levitt, 1976) that the Calpha -Calpha -Calpha bond angles in a protein backbone have a specific distribution. Fig. 2 b presents a histogram of the distribution of Calpha -Calpha -Calpha angles < F(gamma k)> averaged over more than 100 protein structures deposited in the PDB. Similar to the neighbors penalty, Pdst (Eq. 5), the bond angle penalty is computed as:
P<SUB><UP>ang</UP></SUB>=<LIM><OP>∑</OP><LL><UP>k</UP></LL></LIM> <FENCE><FR><NU>(F<SUB><UP>mod</UP></SUB>(&ggr;<SUB><UP>k</UP></SUB>)−⟨F(&ggr;<SUB><UP>k</UP></SUB>)⟩)</NU><DE>0.1 <UP>max</UP>(⟨F(&ggr;<SUB><UP>k</UP></SUB>)⟩, 0.02)</DE></FR></FENCE><SUP>2</SUP>, (15)
where Fmod(gamma k) is the histogram of the current model (bin step equals 5°).

Fig. 3 displays a histogram of the distribution of Calpha -Calpha -Calpha -Calpha dihedral angles versus Calpha -Calpha -Calpha angles (quasi-Ramachandran plot) computed by averaging the distributions for the above PDB models. Following Kleywegt (1997), the histogram can be split into four areas: core (index = 1), additionally allowed (2), generously allowed (3), and disallowed (4). A plausible model should display bond angles and dihedrals concentrated in the core and additionally allowed regions. Each pair of Calpha -Calpha -Calpha -Calpha dihedral angles versus Calpha -Calpha -Calpha bond angles in the model is attributed to a cell in the quasi-Ramachandran plot, and the sum:
P<SUB><UP>dih</UP></SUB>=<FR><NU>1</NU><DE>N</DE></FR> <LIM><OP>∑</OP><LL><UP>i=2</UP></LL><UL><UP>N−2</UP></UL></LIM> (<UP>index</UP>(i)−1)<SUP>2</SUP>, (16)
gives the penalty for improper dihedrals.



View larger version (111K):
[in this window]
[in a new window]
 
FIGURE 3   Distribution of the Calpha backbone angles and dihedrals. The core area is shown in black, the additionally allowed regions in dark gray, the generously allowed in light gray, and the disallowed region in white. Sampling rates of the bond angles and of the dihedrals equal 5° and 10°, respectively.

The generation and modification of the model during SA are the same as in the previous section. The algorithm, implemented in the program GLOOPY, allows native-like configurations to be attributed to the missing loops or domains.

Folding of a model chain composed of individual residues

The most straightforward protein model consisting of Calpha atoms is an interconnected polypeptide chain. This model does not require a connectivity constraint, and the secondary structure elements, if known, can be easily introduced. The chain model is less flexible than the gas of residues, and this increases the chances of being trapped in an incorrect conformation during minimization. As indicated in Svergun et al. (2001a), attempts at ab initio fitting of x-ray solution scattering data starting from a random-walk Calpha chain led to a manifold of native-like models with different fold topologies. The chain model is, however, very useful as a means of restoring the conformation of shorter fragments such as missing loops.

The missing loop(s) are attached to the appropriate residue(s) in the known part of the structure, initially as random-walk chain(s) with a step of 0.38 nm between joints. If a specific portion of the loop is known to form an alpha -helix or beta -sheet (e.g., from secondary structure prediction), an idealized secondary structure template of the appropriate length is inserted. The scoring function is the same as in Eq. 10 but without the spring potentials Pspr. Two types of moves, local and global, are used to modify the variable part of the model maintaining the distances between the adjacent residues and preserving the secondary structure. In both cases, a residue is selected at random among those not belonging to the secondary structure elements. A local move involves random rotation of a residue around the axis drawn through its two neighbors. For a global move, made after each ND local moves, the second residue in the variable domain is selected, which does not belong to the secondary structure elements. The part of the chain between the selected residues is rotated by an arbitrary angle around the axis drawn through these two residues.

The algorithm, implemented in the program CHARGE, is aimed at restoring the conformation of the missing loop(s) and is most useful if information about their secondary structure is available.

Materials

Oligonucleotides were synthesized by Sigma-Genosys (Pampisford, UK). Media reagents were from Merck (Lutterworth, UK). Isopropyl beta -D-thiogalactoside was from Genesys (London, UK). Taq polymerase chain reaction (PCR) Ready-To-Go beads, precast native and SDS polyacrylamide gels, protein molecular weight standards, Coomassie Brilliant Blue, glutathione Sepharose 4B, and Factor Xa were from Amersham Pharmacia Biotech (St. Alban's, UK). DNA molecular weight standards were from Gibco BRL (Paisley, UK). DNA Qiaquick gel extraction and Miniprep kits were from Qiagen (Crawley, UK). Restriction enzymes and T4 DNA ligase were from New England Biolabs (Hitchin, UK). Protein Microcon, Centricon, and Centriprep devices were from Millipore (Watford, UK). Bradford assay reagent and disposable plastic columns were from Biorad (Hemel Hempstead, UK). All other chemicals were from Sigma-Aldrich (Poole, UK).

Construction of plasmid

Plasmid pGEX-DHFR is a pGEX-5X-1 derivative (Amersham). The plasmid encodes Schistosoma japonicum GST, a 10-residue C-terminal linker peptide (containing a protease cleavage site) and the Escherichia coli folA-encoded DHFR. The folA gene was PCR amplified from a genomic DNA preparation of E. coli K-12 cells using primer 1 (5'-GAGTGGATCCCTATCAGTCTGATTGCGGCG-3'), which contains a BamHI restriction site upstream of the second codon (ATC), and primer 2 (5'-CTATCTCGAGTTACCGCCGCTCCAGAAT-3'), which incorporates a unique XhoI restriction site downstream of the TAA stop codon. One cycle of 96°C for 5 min followed by 35 cycles of 96°C for 1 min, 50°C for 1 min, and 72°C or 1.5 min, linked to a final cycle of 72°C for 10 min, generated a 500-bp PCR fragment encoding the E. coli K-12 folA gene. This fragment was gel purified, digested with BamHI and XhoI, and ligated into the BamHI-XhoI restriction sites of the pGEX-5X-1 vector to produce plasmid pGEX-DHFR. Initial clones were obtained by heat-shock transformation into E. coli strain BL21-CODONPLUS(DE3)-RIL. The presence of the folA gene was confirmed by restriction digestion of the transformed construct and by DNA sequencing with an ABI/Perkin-Elmer 377 Automated Sequencer (Perkin-Elmer Applied Biosystems, Norwalk, CT) using the dideoxy method with BigDye Terminator Ready Reaction Kits (Perkin-Elmer).

Protein purification

GST and GST-DHFR proteins were purified as follows. One liter of 2XYT (1.6% (w/v) tryptone, 1.0% (w/v) yeast extract, and 0.5% (w/v) NaCl in distilled water) containing 100 µg/ml ampicillin was inoculated with 10 ml of an overnight culture of E. coli BL21-CODONPLUS(DE3)-RIL transformed with pGEX-DHFR. Cells were grown at 37°C with shaking until a cell density corresponding to an OD600 of 0.6 was reached. Isopropyl beta -D-thiogalactoside was then added to a final concentration of 1 mM, and growth was allowed to continue for another 4 h. Centrifugation of the culture at 8000 × g for 20 min yielded a cell pellet that was resuspended in PBS (0.01 M KH2PO4/K2HPO4 buffer, 0.0027 M KCl, 0.137 M NaCl, pH 7.4; Sigma-Aldrich). Cells were lysed by sonication with three 30-s bursts at full power, and insoluble material was removed by centrifugation at 12,000 × g for 45 min.

Six milliliters of 50% (w/v) glutathione Sepharose 4B (Amersham) in PBS was added to the cell lysate supernatant (typically, 15 ml), which was then incubated at 4°C for 1.5 h with rotation. The material was transferred into a plastic column (Biorad) and washed seven times with 10 ml of PBS. GST, produced from pGEX-5X-1, was eluted by resuspending the glutathione Sepharose in 5 ml of 10 mM glutathione in 50 mM Tris-HCl, pH 8.0, and collecting the flow through from the column after incubation for 10 min at room temperature. This was repeated twice more to retrieve all the GST protein. GST-DHFR, produced from pGEX-DHFR, was eluted intact from the glutathione Sepharose as above for GST. Alternatively, cleavage of the linker between DHFR and GST was attempted by resuspending the glutathione Sepharose in 6 ml of PBS and incubating with 200 µl of 100 µg/ml Factor Xa (Amersham) at 4°C for 24 h with rotation. Protein cleaved from the bound GST was initially collected as flow-through from the column. Subsequent washing of the column with 6 ml of PBS retrieved any additional protein. Proteins eluted from the column with glutathione and those cleaved with Factor Xa were analyzed by SDS and native polyacrylamide gel electrophoresis.

Sample preparation

All GST proteins were buffer exchanged into PBS using HiTrap Desalt columns (Amersham). Pooled samples of GST-DHFR were concentrated in Centriprep and Centricon YM-30 concentrators (Millipore) whereas pooled GST samples required Centriprep and Centricon YM-10 concentrators (Millipore). Protein concentrations were determined by Bradford assay (Biorad). For R2, SAXS measurements were performed at 2.5, 5, 10, and 20 mg/ml; GST measurements were performed at 3, 6, 7.8, and 24 mg/ml; GST-DHFR measurements were performed at 3.7, 5.4, 8.1, and 14.4 mg/ml; nonclaret disjunctional (ncd) measurements were as described (Svergun et al., 2001b).

Scattering experiments, data processing, and analysis

The experimental x-ray scattering data from protein solutions were collected following standard procedures using the X33 camera (Boulin et al., 1986, 1988; Koch and Bordas, 1983) of the European Molecular Biology Laboratory on the storage ring DORIS III of the Deutsches Elektronen Synchrotron with multiwire proportional chambers with delay line readout (Gabriel and Dauvergne, 1982). The data processing (normalization, buffer subtraction, etc.) involved statistical error propagation using the program SAPOKO (D. I. Svergun and M. H. J. Koch, unpublished data). The scattering patterns from R2, GST and GST-DHFR were recorded at sample-detector distances of 3.2 m and 1.4 m, and the wavelength lambda  = 0.15. The scattering patterns recorded at the two sample-detector distances were merged to yield the final composite curves to cover the range of momentum transfer 0.1 nm-1 < s < 5.2 nm-1. Additional details of the experimental procedures and the ncd data collection are described elsewhere (Svergun et al., 2001b). The value of Dmax was determined from the scattering patterns using the orthogonal expansion program ORTOGNOM (Svergun, 1993). The x-ray scattering patterns for simulated examples and those from the incomplete atomic models of proteins were computed from the structures taken from the PDB using the program CRYSOL (Svergun et al., 1995). The models without a one-to-one residue correspondence were superimposed using the program SUPCOMB (Kozin and Svergun, 2001), and those with such correspondence were computed with the algorithm (Kabsch, 1978).


    RESULTS AND DISCUSSION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

Computer programs and testing

The programs CREDO, CHADD, GLOOPY, and CHARGE all run on IBM PC-compatible machines under Windows 9x/NT/2000/XP and Linux as well as on major Unix platforms. To reduce the time required for computations, the model scattering intensity and the penalties are not recomputed after each modification of the structure but rather updated as previously described (Svergun et al., 2001a). All the programs are able to take into account particle symmetry by generating symmetry mates for the residues in the asymmetric unit (point groups P2 to P6 and P222 to P62 are supported). The programs were tested on simulated examples to adjust the parameters of the SA procedures. The values T0 = 10-3, NT = 5000radical ND and eta  = 0.9 were found to ensure convergence. The default values of the penalty weights for different algorithms are summarized in Table 1.


                              
View this table:
[in this window]
[in a new window]
 
TABLE 1   Types of the models, default penalty weights, and typical applications

Method validation using simulated examples

To validate reconstruction procedures, a simulated fusion protein was constructed using the crystallographic coordinates of hen egg-white lysozyme (129 residues, PDB file 6lyz) (Diamond, 1974) as the N-terminal domain, with bovine pancreas trypsin inhibitor (BPTI; 58 residues, PDB entry 4pti) (Marquart et al., 1983) fused to the C-terminus of the former protein (Fig. 4, a and b). The theoretical scattering curve of the fusion protein was computed using CRYSOL (Fig. 5, curve 1) and was then used to reconstruct the structure of the BPTI domain assuming that the lysozyme structure is known. The program CREDO was used to fit DR models to the simulated data yielding a reasonable representation of the overall shape of the BPTI. However, in some cases the BPTI domain was not oriented next to the C-terminus of lysozyme as in the original simulated fusion protein (see typical example in Fig. 4 a). This is not surprising given that information about the interface between the proteins is missing in the CREDO reconstruction. It is interesting to note that even though the location of the interface is incorrect, the overall low-resolution structure of the restored models after appropriate rotation and translation agrees well with that of the simulated fusion protein (Fig. 4 a). To improve relative domain orientation, we used the program CHADD, which explicitly uses information about the location of the interface. In the example presented here, the C-terminus of lysozyme was identified as the fusion point. The shapes of the resulting added domains obtained in independent runs of CHADD were consistent with the crystal structure of BPTI, although their positions varied by 0.3-0.5 nm. Fig. 4 b presents an averaged result of 10 independent runs, which predicts the correct position and shape of the BPTI domain fairly well.



View larger version (49K):
[in this window]
[in a new window]
 
FIGURE 4   Reconstruction of the missing domain in a fictitious fusion protein. A molecule of BPTI (green) is attached to the C-terminal of hen egg-white lysozyme (blue). The two molecules are displayed as Calpha traces and the reconstructed models as semitransparent spheres. (a) Typical reconstruction by CREDO. The orientation with the lysozyme molecule overlapped is shown in red; the orientation yielding the best overlap with the entire complex is shown in green. (b) Average of five independent reconstructions by CHADD (probable shape and position of the BPTI domain is displayed in yellow). Comparison of the atomic structure of lysozyme with the models reconstructed by the program GLOOPY: green, correct fold of the missing loop; red, typical restored fold; blue, the rest of the structure. (c) Missing C-terminal tail; (d) missing N-terminal tail; (e) missing loop in the middle of the sequence. On all panels, the bottom view is rotated by 90° counterclockwise around the x axis.



View larger version (14K):
[in this window]
[in a new window]
 
FIGURE 5   Scattering patterns from the model lysozyme structures. (1) Complex with BPTI; (2) missing 10 residues at the C-terminal; (3) missing 15 residues at the N-terminal; (4) missing 15 residues in the middle. · · ·, scattering from full-length structures; --- --- ---, scattering from the models without the missing fragments; ------, scattering from the restored models.

To validate the loop reconstruction procedures, several lysozyme models were made containing deletions in the following regions: 1) residues 120-129 located at the C-terminus, 2) residues 1-15 containing an alpha -helix located at the N-terminus, and 3) residues 40-55 containing a beta -sheet located on the surface of the structure. First, the theoretical scattering pattern of the intact protein was calculated using CRYSOL. Using this scattering pattern and the coordinates of each deletion model, missing loop regions were reconstructed using the program GLOOPY. In all cases, theoretical scattering curves of the reconstructed proteins, obtained after addition of the missing loop regions, gave good fits to the simulated scattering pattern of the intact protein (Fig. 5, curves 2-4). When compared with Calpha coordinates of the crystal structure of lysozyme, typical restored models (Fig. 4, c-e) have an overall RMSD equal to 0.17, 0.24, and 0.25 nm for deletions 1, 2, and 3, respectively. For comparison, generation of the missing fragments as random-walk self-avoiding chains yields the average RMSD values of 0.37, 0.53, and 0.51, respectively. Use of the program CHARGE for deletion 2 forces residues 5-15 to form an alpha -helix, thus further reducing the RMSD (to ~0.15-0.2 nm; results not shown).

Conformational mobility in small loops/domains

The failure to observe structural elements in electron density maps arising from protein crystal structures is often due to conformational mobility or heterogeneity. The application of reconstruction methods offers the possibility of constructing a model for the missing loops or domains both in terms of their structure and their position in three-dimensional space. Two examples are presented below that illustrate these concepts.

In the first example, a truncated form of the Drosophila motor protein ncd was studied using SAXS (Svergun et al., 2001b). The native ncd protein is 700 residues in length. A construct named MC6 was made that expresses the C-terminal 368 residues (M333-K700) of ncd. This construct appears to be monomeric in solution as it lacks an N-terminal coiled-coil region (residues 196-347) that would otherwise mediate dimerization. Using crystallographic coordinates of a ncd variant (PDB entry 1cz7) (Kozielski et al., 1999), a partial three-dimensional model of MC6 was produced (Svergun et al., 2001b). This model lacked the 33 C-terminal residues absent in the crystal structure. The scattering curve computed from the MC6 model fails to fit the scattering pattern of the protein in solution (Fig. 6, curve 1; chi  = 1.98). Addition of the missing loop using the programs GLOOPY and CHARGE in these studies significantly improved the fit (chi  = 0.89). The loop conformations yielded by the programs in different independent reconstructions are similar to each other, suggesting a fan-like manifold of orientations (Fig. 7 a). In an earlier study using trial secondary structure motifs, this region was modeled as an antiparallel two-stranded beta -sheet (Svergun et al., 2001b). The conformation of this tentative model fits within the plane of the fan and is also of a similar length as the conformations provided by GLOOPY and CHARGE. These results suggest that the loop is flexible in solution, moving predominantly in the plane of the fan.



View larger version (19K):
[in this window]
[in a new window]
 
FIGURE 6   X-ray scattering patterns from MC6 construct (1), the protein R2 (2), GST (3), and its fusion with DHFR (4) (· · · with error bars); scattering of the crystallographic models where the missing fragments are absent (--- --- ---); and scattering from the reconstructed models (------). The scattering patterns are appropriately displaced in the logarithmic scale, and the inner part of the R2 pattern is shown in the inset for better visualization.



View larger version (32K):
[in this window]
[in a new window]
 
FIGURE 7   Reconstruction of missing loops in proteins. Crystallographic models are displayed in blue, with the reconstructed fragments as red Calpha -traces and yellow semitransparent spheres. (a) Missing loop in the MC6 construct obtained by GLOOPY and CHARGE (several independent reconstructions are displayed); (b) Missing C-terminal loop in dimeric R2 obtained by CHARGE (the alpha -helical portion of the variable domain and the portion for which no secondary structure was assumed are displayed in red and green, respectively); (c) A linker at the C-terminal in dimeric GST reconstructed by GLOOPY (right panel) and the structure of the GST monomer fused with a conserved neutralizing epitope GP41 (15 residues displayed in green, PDB entry 1gne); (d) A DHFR domain in a fusion GST+DHFR protein reconstructed by CREDO (spheres) and CHADD (Calpha -traces). Solution by CREDO is superimposed with the atomic model of DHFR on the right panel. The left panel presents the crystallographic structure of the GST fusion with alpha -Na,K-ATPase (36 residues displayed in green, PDB entry 1bg5). On all panels, the bottom view is rotated by 90° counterclockwise around the x axis.

The second example illustrates the use of information about secondary structure for the reconstruction of a missing loop. Specifically, the crystallographic model of a homodimeric protein R2 of ribonucleotide reductase from E. coli (PDB entry 1xik; molecular mass = 79 kDa) (Logan et al., 1996) was solved to 1.7-Å resolution containing 341 residues per monomer. The C-terminal 35 residues are missing in the crystal structure, and the scattering curve computed from the crystallographic model displays small but significant systematic deviations from the experimental data (chi  = 1.30; Fig. 6, curve 2 and inset; S. Kuprin, Karolinska Institute, Stockholm, Sweden, personal communication, 1998). According to secondary structure prediction programs (Cuff and Barton, 1999, 2000; Cuff et al., 1998), a major portion of the missing fragment (residues 345-373) is predicted to form an alpha -helix. Fig. 7 b shows the position of a typical reconstruction of the fragment using the program CHARGE, which gives a significant improvement in the fit to the experimental data (chi  = 1.07). The result suggests that the alpha -helix from each monomer subunit extends away from the core structure of the protein to produce a biantennary structure in the dimer. This structure is likely to occupy a number of conformations, which is consistent with the lack of interpretable electron density in the original crystal structure (Logan et al., 1996).

GST-fusion protein domains

The pGEX series of vectors (Amersham) (Smith and Johnson, 1988) are designed to enable inducible, high-level intracellular expression of genes as fusions with the Schistosoma japonicum GST, a 26-kDa protein forming homodimers in solution. Crystal structures are available for GSTs from a number of sources (Ji et al., 1992; Parker et al., 1990), including recombinant S. japonicum GST purified from pGEX-3X (Amersham) (McTigue et al., 1995). This recombinant S. japonicum GST contains an extra 13 residues at the C-terminus compared with the native S. japonicum GST, but this linker peptide is absent in the PDB entry (1gta) (McTigue et al., 1995). GST was expressed and purified for SAXS data collection and analysis from similar plasmid in the pGEX series, pGEX-5X-1 (Amersham). Compared with the native S. japonicum GST, this GST has an extra 22 residues at the C-terminus. The x-ray solution scattering pattern from the latter protein is presented in Fig. 6. The scattering curve computed from a homodimer built from the crystallographic structure of GST lacking the C-terminal residues yields a poor fit to the experimental data (Fig. 6, curve 3; chi  = 1.30). GLOOPY was used to model the missing linker, assuming P2 symmetry for the entire structure. Several independent runs produced similar extended conformations of the modeled linker (a typical result is presented in Fig. 7 c). The theoretical scattering of the GST crystal structure combined with this modeled linker gave a significant improvement in the fit to the experimental data with chi  = 0.81.

SAXS analysis was also performed on a GST fusion protein. The folA gene, encoding dihydrofolate reductase from E. coli K-12 was cloned into the same vector, pGEX-5X-1, from which GST was expressed. This enabled production of a fusion protein, GST-DHFR, consisting of the 218 residues of the S. japonicum GST followed by a 10-residue linker containing the Factor Xa cleavage site and 158 residues of E. coli K-12 DHFR. Fig. 6 (curve 4) shows the experimental scattering pattern of GST-DHFR. Models of the dimeric fusion protein were built using the programs CREDO and CHADD by fixing the structure of dimeric GST and then adding the linker and the DHFR domain as a variable part and assuming P2 symmetry for the entire complex. Both programs gave good fits to the experimental data with chi  = 1.02 (Fig. 6, curve 4). The shape and position of the missing domain for each model reconstructed by the two methods were consistent with each other and also with the crystallographic model of DHFR (PDB entry 1ra9) (Sawaya and Kraut, 1997), as illustrated in Fig. 7 d. It would appear from comparison of the linker regions of GST and GST-DHFR that the configuration of the linker is different in each case (cf. Fig. 7 c). Attempts to cleave the linker region of GST-DHFR using Factor Xa were unsuccessful, even when the wt % was increased above 1% and the incubation time was >16 h. Even in the presence of 0.05% SDS, the percentage of protein cleaved was minimal. Resistance to proteolytic cleavage may well be due to a more compact conformation of the linker region in GST-DHFR and/or steric hindrance causing the Factor Xa cleavage site to be inaccessible. Although there are no crystal structures in the PDB of proteins >40 residues fused to GST, comparison of the structures of two GST fusions (Fig. 8, a and b), PDB entries 1gne (Lim et al., 1994) and 1bg5 (Zhang et al., 1998), with the model of GST-DHFR shows that there appears to be a similar orientation of GST and its fusion partner in all three cases.



View larger version (26K):
[in this window]
[in a new window]
 
FIGURE 8   Schematic illustration of the model types and additional information used by different programs.


    CONCLUSION
TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

To summarize, four algorithms have been written to provide an appropriate tool for each of the various situations in which a structure lacks a loop or domain. The choice of method depends on the information available regarding the known part of the model, the missing fragment, and the interface. If a low-resolution model of the known part is available (e.g., from electron microscopy or from SAXS by ab initio methods (Svergun, 1999; Svergun et al., 2001a)), the location of the interface is usually unknown and the missing fragment can be added using the program CREDO. In this case, the result is a low-resolution model of the domain structure of the complex. For high-resolution models, the programs CHADD and GLOOPY can build missing loops and domains attached to specific residues(s). Furthermore, GLOOPY tries to construct native-like folds by accounting for excluded volumes of side chains, hydrophobic interactions, knowledge-based potentials, and the Calpha bond and dihedral angles. If the secondary structure of the missing portion is known, the program CHARGE allows additional constraints to be applied to the model by incorporating alpha -helices and/or beta -sheets in the variable fragment. As the model of an interconnected Calpha chain used by CHARGE is less flexible than a free gas of residues implemented in the other programs, CHARGE is better suited to reconstruct missing loops rather than missing domains. The main features and possible applications of the four algorithms are summarized in Table 1 and Fig. 8.

Even though the programs CHADD, GLOOPY, and CHARGE yield the missing fragments in the form of folded Calpha chains, these should be considered as approximate models only. Solution scattering, being a low-resolution method, does not provide an exact fold but rather a probable configuration of the volume occupied by the missing portion. In all of the above algorithms, scattering from the model is computed using Eqs. 1-3, which do not explicitly take averaging over possible different conformations of flexible loops or terminal fragments into account. Such an average would not significantly influence the results given the low resolution of the scattering data but would take much longer computation times. The methods are trying to obtain a single equivalent conformation of the missing domain, and it is also useful to analyze the results of several independent SA runs to generate averaged probability maps. This analysis allows refinement of the shape and position of missing domains (see Fig. 4, a and b) and better visualization of regions occupied by the missing loops (Fig. 7 a). When using CHARGE, care must be taken not to restrict the model too much based on secondary structure predictions (which generally are no better than 70% accurate). In the above example for R2, all major techniques predicted an alpha -helix with high probability, which made it possible to use a long helical fragment for constructing the model in Fig. 7 b.

Missing loop residues can be added to known high-resolution structures using homology modeling (Mendelson and Morris, 1997; Perera et al., 2000). In general, the reliability of structures produced using homology modeling is high for short loops but decreases as the length of the fragment to be added is increased. Using solution scattering, the situation is precisely the opposite: the larger the missing fragment, the more significant its contribution to the entire scattering pattern, and the missing residues can be modeled more reliably. In practical terms, one can expect the methods presented here to be useful for missing fragments consisting of ~5-10% of the entire structure (20-40 residues for a 50-kDa protein) and higher. For shorter loops, homology modeling may be sufficient; however, solution scattering data can be used as an additional restraint (in particular, for rigid body refinement of the orientation of the fragment to be added) and can also be used for validation of the final model (see, e.g., Zheng and Doniach, 2002). Moreover, it should be stressed that the methods presented are not limited to amending crystallographic models with disordered loops but are also applicable to the addition of missing domains to low-resolution models and to fusion proteins, especially when no crystals are available.

In the model systems presented here, experimental SAXS data has allowed the reconstruction of both missing loops and domains, providing a structural description of disordered regions. The reconstructions are based on the experimental data of ~1.2-nm resolution, but the actual resolution of the models may be higher because of the additional information used. In particular, histogram and angular penalties (Figs. 2 and 3) ensure adequate behavior of the model scattering curves at higher momentum transfers. In the case of the Drosophila motor and R2 ribonuclease reductase proteins, modeling predicts extended structures from the surface of the globular core. Such structures could indeed show large flexibility in solution, which may explain why the regions could not be modeled from the crystallographic electron density maps. Modeling studies of GST expressed from the pGEX system give a description of the linker region, which was not visible in the original crystal structure (McTigue et al., 1995). The model of GST-DHFR also provides a visualization of how such fusions appear in solution. In particular, the linker region appears to adequately separate both globular domains, suggesting that GST does not, per se, directly influence folding of its partner in protein-protein interactions. In addition, the model shows how the fused protein (i.e., DHFR) can occlude the linker, resulting in resistance to protease digestion in this case. Taken together, these examples demonstrate how such reconstruction methods using SAXS data have the potential to add missing fragments to available high- or low-resolution protein models. Indeed, as three-dimensional structural information from larger multi-protein complexes emerges, the true potential of these techniques may be realized for modeling both domains and interfaces responsible for macromolecular assembly where inherent flexibility and conformational heterogeneity limit high-resolution visualization. Modeling using the protein structure representation as an ensemble of residues could also become useful for the interpretation of low-resolution crystallographic maps (Guo et al., 1999).

The executable codes of the programs CREDO, CHADD, GLOOPY, and CHARGE are available as Wintel beta -releases from the EMBL-Hamburg website (http://www.embl-hamburg.de/ExternalInfo/Research/Sa