| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Biophys J, December 2002, p. 3113-3125, Vol. 83, No. 6


and
*European Molecular Biology Laboratory, Hamburg Outstation, D-22603
Hamburg, Germany;
Physics Department, Moscow State
University, 117234 Moscow, Russia;
Department of
Biological Sciences, Centre for Molecular Microbiology and Infection,
Imperial College of Science, Technology and Medicine, London SW7 2AY,
United Kingdom; and §Institute of Crystallography, Russian
Academy of Sciences, 117333 Moscow, Russia
| |
ABSTRACT |
|---|
|
|
|---|
Inherent flexibility and conformational heterogeneity in
proteins can often result in the absence of loops and even entire domains in structures determined by x-ray crystallographic or NMR
methods. X-ray solution scattering offers the possibility of obtaining
complementary information regarding the structures of these disordered
protein regions. Methods are presented for adding missing loops or
domains by fixing a known structure and building the unknown regions to
fit the experimental scattering data obtained from the entire particle.
Simulated annealing was used to minimize a scoring function containing
the discrepancy between the experimental and calculated patterns and
the relevant penalty terms. In low-resolution models where interface
location between known and unknown parts is not available, a gas of
dummy residues represents the missing domain. In high-resolution models where the interface is known, loops or domains are represented as
interconnected chains (or ensembles of residues with spring forces
between the C
atoms), attached to known position(s) in
the available structure. Native-like folds of missing fragments can be
obtained by imposing residue-specific constraints. After validation in
simulated examples, the methods have been applied to add missing loops
or domains to several proteins where partial structures were available.
| |
INTRODUCTION |
|---|
|
|
|---|
Protein function is related not only to the
three-dimensional arrangement of polypeptide chains but also to their
intrinsic mobility. Techniques such as x-ray crystallography and NMR
can yield high-resolution information regarding the positions of
individual atomic groups within a macromolecule, but flexible or
disordered regions may appear to be absent. Such regions may be of
significant functional importance and can include, for example, a loop
in an enzyme active site, a receptor-binding motif, or an antigenic epitope. In large multi-domain proteins, inherent flexibility between
domains can prevent successful crystallization, and in these cases
crystallographic or NMR data may be limited to studies of individual
domains produced by genetic or proteolytic methods. However, it is
apparent that complementary approaches are required to analyze the
structure of intact multi-domain proteins and assemblies, especially in
view of recent initiatives aimed at large-scale expression and
purification of proteins for subsequent structure determination (e.g.,
Edwards et al., 2000
).
One such approach is small-angle x-ray scattering (SAXS) (Feigin and
Svergun, 1987
). This technique can yield structural information about
macromolecules in solution with proteins from as small as 6 kDa (e.g.,
Sayers et al., 1999
) to large macromolecular complexes such as the
ribosome (Svergun and Nierhaus, 2000
). SAXS patterns result from an
average of the scattering from the entire ensemble of randomly oriented
particles in the sample, and this lowers the resolution of the method.
Nevertheless, in contrast to x-ray crystallographic analysis where
flexible regions of a structure may result in poorly interpretable
electron density, solution scattering patterns are sensitive to these
disordered regions, yielding information about their average
conformation. The SAXS method can thus provide (at low resolution)
information complementary to that of crystallography and NMR. Solution
scattering also permits one to construct models of multi-domain
proteins and macromolecular complexes from high-resolution structures
of individual domains or subunits. Rigid body modeling, successfully
used by different groups (Ashton et al., 1997
; Krueger et al., 1997
;
Svergun et al., 1997
, 1998a
, 2000
), is an effective way to characterize
complex structures. The methods to compute solution scattering patterns accurately from atomic models and rapidly evaluate scattering from
complex particles are now well established (Svergun, 1994
, 1995
,
1998b
). These methods coupled with three-dimensional display and
manipulation programs allow interactive or automated searches of
positional parameters to fit the experimental scattering from the
complex (Konarev et al., 2001
; Kozin and Svergun, 2000
).
In cases where the portions of a macromolecule or complex lack a
three-dimensional structure description, alternative methods are
required (beyond rigid-body refinement) to generate a model. For
example, the known part of the structure (either high- or low-resolution model) can be fixed, and missing portions, such as the
disordered loops or domains, can be then be modeled to fit the
experimental scattering data obtained from the intact particle. In the
present paper, a recently proposed dummy-residues model (Svergun et
al., 2001a
) is further developed to construct the algorithms for
complementing high- and low-resolution partial models of protein
structures. Simulated annealing is used to minimize a scoring function
containing the discrepancy between the experimental and calculated
patterns and relevant penalty terms. Where applicable, information
about the primary and secondary structure is used to restrain the model
and to provide native-like conformations of the missing structural fragments.
After validation in simulated examples, the potential of this approach
has been explored using three model systems. These methods have first
been applied to develop models for small contiguous loops (~30-35
residues), which are absent in the crystal structures of a
Drosophila motor protein (Kozielski et al., 1999
) and the R2
protein of Escherichia coli ribonucleotide reductase (Logan et al., 1996
). Second, reconstruction of an entire missing domain has
been attempted using experimentally observed scattering data from a
fusion protein. This fusion consists of Schistosoma
japonicum glutathione S-transferase (GST) and E. coli dihydrofolate reductase (DHFR). Although a few examples exist
of crystal structures of GST fused with relatively small fusion
fragments (Lim et al., 1994
; Ware et al., 1999
; Zhang et al., 1998
),
proteins of interest are often isolated by proteolytic digestion of the
linker region (Nagai and Thogersen, 1984
) before structural analysis.
Therefore, little is known about the conformation of the linker region
or the structure of a globular protein fused to GST. Using SAXS and the
reconstruction methods, new information regarding domain and linker
orientations in this popular fusion system is presented. In
combination, analysis of these model systems provides an insight into
the possible scope of these reconstruction techniques, from small loops
to multi-domain assemblies.
| |
MATERIALS AND METHODS |
|---|
|
|
|---|
Dummy-residues approach
The scattering intensity I(s) from a
dilute monodispersed solution of macromolecules is an isotropic
function depending on the modulus of the scattering vector
s = (s,
), where
is the solid angle
in reciprocal space, s = (4
/
)sin
,
is the
wavelength, and 2
is the scattering angle. The x-ray scattering
intensity is proportional to the scattering from a single particle
averaged over all orientations and can be expressed as:
|
(1) |
s, may differ from that of the
hydration shell,
b, yielding a nonzero
contrast for the shell 
b =
b
s (Svergun et
al., 1995In the method (Svergun et al., 2001a
), the protein structure is
represented by an ensemble of dummy residues (DRs) centered at the
positions of virtual C
atoms. A simulated
annealing (SA) procedure (Kirkpatrick et al., 1983
) is used to find DR
positions by fitting the experimental data and simultaneously providing a chain-compatible structure. This is achieved by minimizing a scoring
function E(r) =
2 + 
iPi(r)
where
2 is the discrepancy between the
experimental and calculated scattering patterns and the penalties
Pi(r) restrain the solution to ensure a chain-compatible arrangement of the DRs. The weights
i are selected in such a way that the total
penalty

iPi(r) yields a significant contribution (~10-50%) to
E(r) at the end of the minimization. It has been
demonstrated that the DR representation adequately represents solution
scattering patterns up to a resolution of 0.5 nm and that the method
allows an ab initio restoration of domain structures of proteins
(Svergun et al., 2001a
). In the present paper, the DR approach is
extended to build missing domains or loops around a known part of the
protein structure. Depending on the information available, four models
are considered, differing by the representation of the missing portion
of the structure and by the set of constraints.
Computation of the scattering intensity
The scattering intensity
Imod(s) from a protein
model consisting of N DRs positioned at
ri is calculated as described (Svergun
et al., 2001a
) using Debye's formula (Debye, 1915
):
|
(2) |
rj| is the distance between the
ith and jth point. To generate the hydration
shell of thickness
r = 0.3 nm, the most distant residue is found along each direction of a quasi-uniform angular grid of
M
N vectors, and a solvent atom with the
form factor gi(s) =
(4
r
r
b
is placed 0.5 nm outside the protein. Following previously published
methods (Svergun et al., 1995
g(s)
(Fig. 1
a). The DR-form factor
gi(s) =
g(s)
is taken when using models that do not
account for the primary structure of the protein. To account for the
internal residue structure, a correction factor
c(s)
is introduced. More than 100 proteins
with known structures were taken from the Protein Data Bank (PDB)
(Bernstein et al., 1977
c(s)
=
Ifull(s)/Imod(s)
is evaluated over the ensemble (Fig. 1 b). As demonstrated
in Svergun et al. (2001a)
c(s)
Imod(s) yields an adequate
representation of the scattering pattern of a protein up to a
resolution of 0.5 nm.
|
For the DR models accounting for the primary structure, form factors of the individual residues (Fig. 1 a) are computed by averaging the form factors of different conformations in the PDB files. The correction factor calculated as described above for the case of DRs is presented in Fig. 1 b. Simulations performed on proteins with known structures demonstrated, not unexpectedly, that the use of individual residues yields an even better accuracy than the dummy residues.
The discrepancy
2 between the calculated curve
and experimental data
Iexp(s) measured at
n points sj,
j = 1, ... n is computed as:
|
(3) |
(sj) are the
experimental errors and µ is an overall scaling coefficient.
Simulated annealing protocol
For all the models described here, SA (Kirkpatrick et al., 1983
)
is used for global minimization of the scoring function. The main aim
of this method is to perform random modifications of the system (i.e.,
of the current residue arrangement) by always moving to configurations
that decrease the scoring function E(r) but to
also occasionally move to configurations that increase E(r). The probability of accepting the latter
moves decreases in the course of the minimization (the system is
cooled). At the beginning, the temperature is high and the changes are
almost random, whereas at the end a configuration with (nearly) minimum E(r) is reached. The algorithm is implemented in
its faster simulated quenching (Ingber, 1993
; Press et al., 1992
)
version as follows. 1) The known part of the structure is loaded and
moved to the origin and remains fixed during minimization. The rest of
the structure is then generated depending on the model used. A value of
the goal function E(r) is computed and a high starting temperature T0 is selected.
2) A random modification (move from r to r') of
the system is performed (specific ways of generation and modification
of the system are considered below). 3) Positions of the solvent atoms
accounting for the border solvent layer are updated if necessary and a
difference
E = E(r')
E(r) is computed. If
E < 0, the move is accepted; if
E
0, the move is
accepted with a probability
exp(
E/T). 4) Steps 2 and 3 are
repeated a sufficient number of times
NT to equilibrate the system, and the
temperature is lowered (T' =
T,
< 1)
afterwards. The system is cooled until no improvement in E(r) is observed.
Types of dummy-residue models
Free dummy-residues model
This model is an extension of the original DR model (Svergun et al., 2001a
|
|
(4) |
|
(5) |
N(Rk)
is
a histogram of the average number of C
atoms
in a 0.1-nm-thick spherical shell surrounding a given C
atom as a function of the shell radius for 0 < Rk < 1 nm observed for real
proteins.
Nmod(Rk)
is such a histogram for the model, and the weights
W(Rk) are inversely
proportional to the variations of
N(Rk)
(Fig.
2 a). The summation in Eq. 5
is performed over the DRs in the variable portion of the model.
|
|
(6) |


|
(7) |

Dummy-residues model with spring forces between neighbors
This approach builds chains of DRs attached to given point(s) or residue(s) in the known part of the structure. In contrast to the previous model, it is explicitly required that the ith DR be separated by 0.38 nm from the (i + 1)th one. The scoring function is:
|
(8) |
|
(9) |


chain
attached to the given point(s) in the known structure. This algorithm
(program CHADD) is useful for adding missing loops or terminal portions
to high-resolution models but can also be used for missing domain restoration.
Individual-residues model with spring forces between neighbors
This model is similar to the previous one but accounts for the primary structure of the protein. Not only is the scattering intensity computed using the individual form factors, but also residue-specific information is formulated as additional penalties to further restrain the solution and to generate native-like folds of the missing loop/domain. The scoring function has the form:
|
(10) |
|
|
(11) |
|
(12) |
1)th, jth, and (j + 1)th residues. The penalty of Eq. 12 forces the hydrophobic
residues to be buried in the interior of the protein. Here,
Bj is the number of all neighbors of
jth residue except for the(j
2)th,
(j
1)th, (j + 1)th, and
(j + 2)th residues (the neighboring distance equals 1 nm).
The penalty Peng uses knowledge-based
potentials to minimize the empirical free energy of the model. The
interaction potentials between residues in proteins can be computed
from the analysis of the PDB structures (Miyazawa and Jernigan, 1999
|
(13) |
backbone only, and
the excluded volume effects between the backbone atoms due to the penalties Pdst and
Pspr do not account for the side
chains. To compensate for this, pseudo-C
atoms
representing the side chains are introduced following the lolly-loop
model (Aszodi et al., 1995
-C
vector depends on
the positions of the (i
1)th, ith, and
(i + 1)th C
atoms. The
C
-C
distance and the
van der Waals radius r
i of the
pseudo-C
atom depend on the type of the
ith residue. The additional excluded volume effect is taken
into account by minimizing the averaged cross-volume of all spheres
representing the C
(van der Waals radius
r
= 0.19 nm) and
pseudo-C
atoms:
|
(14) |


or C
atom belonging
to ith residue with the C
atom
belonging to jth residue, respectively.
The two other penalties impose restrictions on the distribution of bond
and dihedral angles of the model chain. It is well known (Irbaeck et
al., 1997
-C
-C
bond angles in a protein backbone have a specific distribution. Fig. 2
b presents a histogram of the distribution of
C
-C
-C
angles
F(
k)
averaged over
more than 100 protein structures deposited in the PDB. Similar to the
neighbors penalty, Pdst (Eq. 5), the
bond angle penalty is computed as:
|
(15) |
k) is
the histogram of the current model (bin step equals 5°).
Fig. 3 displays a histogram of the
distribution of
C
-C
-C
-C
dihedral angles versus
C
-C
-C
angles (quasi-Ramachandran plot) computed by averaging the
distributions for the above PDB models. Following Kleywegt (1997)
-C
-C
-C
dihedral angles versus
C
-C
-C
bond angles in the model is attributed to a cell in the
quasi-Ramachandran plot, and the sum:
|
(16) |
|
Folding of a model chain composed of individual residues
The most straightforward protein model consisting of C
atoms is an interconnected polypeptide
chain. This model does not require a connectivity constraint, and the
secondary structure elements, if known, can be easily introduced. The
chain model is less flexible than the gas of residues, and this
increases the chances of being trapped in an incorrect conformation
during minimization. As indicated in Svergun et al. (2001a)
chain led to a manifold of
native-like models with different fold topologies. The chain model is,
however, very useful as a means of restoring the conformation of
shorter fragments such as missing loops.
The missing loop(s) are attached to the appropriate residue(s) in the
known part of the structure, initially as random-walk chain(s) with a
step of 0.38 nm between joints. If a specific portion of the loop is
known to form an
-helix or
-sheet (e.g., from secondary structure
prediction), an idealized secondary structure template of the
appropriate length is inserted. The scoring function is the same as in
Eq. 10 but without the spring potentials
Pspr. Two types of moves, local and
global, are used to modify the variable part of the model maintaining
the distances between the adjacent residues and preserving the
secondary structure. In both cases, a residue is selected at random
among those not belonging to the secondary structure elements. A local
move involves random rotation of a residue around the axis drawn
through its two neighbors. For a global move, made after each
ND local moves, the second residue in
the variable domain is selected, which does not belong to the secondary
structure elements. The part of the chain between the selected residues
is rotated by an arbitrary angle around the axis drawn through these
two residues.
The algorithm, implemented in the program CHARGE, is aimed at restoring
the conformation of the missing loop(s) and is most useful if
information about their secondary structure is available.
Materials
Oligonucleotides were synthesized by Sigma-Genosys (Pampisford,
UK). Media reagents were from Merck (Lutterworth, UK). Isopropyl
-D-thiogalactoside was from Genesys (London, UK).
Taq polymerase chain reaction (PCR) Ready-To-Go beads,
precast native and SDS polyacrylamide gels, protein molecular weight
standards, Coomassie Brilliant Blue, glutathione Sepharose 4B, and
Factor Xa were from Amersham Pharmacia Biotech (St. Alban's, UK). DNA
molecular weight standards were from Gibco BRL (Paisley, UK). DNA
Qiaquick gel extraction and Miniprep kits were from Qiagen (Crawley,
UK). Restriction enzymes and T4 DNA ligase were from New England
Biolabs (Hitchin, UK). Protein Microcon, Centricon, and Centriprep
devices were from Millipore (Watford, UK). Bradford assay reagent and
disposable plastic columns were from Biorad (Hemel Hempstead, UK). All
other chemicals were from Sigma-Aldrich (Poole, UK).
Construction of plasmid
Plasmid pGEX-DHFR is a pGEX-5X-1 derivative (Amersham). The plasmid encodes Schistosoma japonicum GST, a 10-residue C-terminal linker peptide (containing a protease cleavage site) and the Escherichia coli folA-encoded DHFR. The folA gene was PCR amplified from a genomic DNA preparation of E. coli K-12 cells using primer 1 (5'-GAGTGGATCCCTATCAGTCTGATTGCGGCG-3'), which contains a BamHI restriction site upstream of the second codon (ATC), and primer 2 (5'-CTATCTCGAGTTACCGCCGCTCCAGAAT-3'), which incorporates a unique XhoI restriction site downstream of the TAA stop codon. One cycle of 96°C for 5 min followed by 35 cycles of 96°C for 1 min, 50°C for 1 min, and 72°C or 1.5 min, linked to a final cycle of 72°C for 10 min, generated a 500-bp PCR fragment encoding the E. coli K-12 folA gene. This fragment was gel purified, digested with BamHI and XhoI, and ligated into the BamHI-XhoI restriction sites of the pGEX-5X-1 vector to produce plasmid pGEX-DHFR. Initial clones were obtained by heat-shock transformation into E. coli strain BL21-CODONPLUS(DE3)-RIL. The presence of the folA gene was confirmed by restriction digestion of the transformed construct and by DNA sequencing with an ABI/Perkin-Elmer 377 Automated Sequencer (Perkin-Elmer Applied Biosystems, Norwalk, CT) using the dideoxy method with BigDye Terminator Ready Reaction Kits (Perkin-Elmer).
Protein purification
GST and GST-DHFR proteins were purified as follows. One liter of
2XYT (1.6% (w/v) tryptone, 1.0% (w/v) yeast extract, and 0.5% (w/v) NaCl in distilled water) containing 100 µg/ml ampicillin was inoculated with 10 ml of an overnight culture of E. coli
BL21-CODONPLUS(DE3)-RIL transformed with pGEX-DHFR. Cells
were grown at 37°C with shaking until a cell density corresponding to
an OD600 of 0.6 was reached. Isopropyl
-D-thiogalactoside was then added to a final
concentration of 1 mM, and growth was allowed to continue for another
4 h. Centrifugation of the culture at 8000 × g
for 20 min yielded a cell pellet that was resuspended in PBS (0.01 M
KH2PO4/K2HPO4
buffer, 0.0027 M KCl, 0.137 M NaCl, pH 7.4; Sigma-Aldrich). Cells were
lysed by sonication with three 30-s bursts at full power, and insoluble material was removed by centrifugation at 12,000 × g
for 45 min.
Six milliliters of 50% (w/v) glutathione Sepharose 4B (Amersham) in PBS was added to the cell lysate supernatant (typically, 15 ml), which was then incubated at 4°C for 1.5 h with rotation. The material was transferred into a plastic column (Biorad) and washed seven times with 10 ml of PBS. GST, produced from pGEX-5X-1, was eluted by resuspending the glutathione Sepharose in 5 ml of 10 mM glutathione in 50 mM Tris-HCl, pH 8.0, and collecting the flow through from the column after incubation for 10 min at room temperature. This was repeated twice more to retrieve all the GST protein. GST-DHFR, produced from pGEX-DHFR, was eluted intact from the glutathione Sepharose as above for GST. Alternatively, cleavage of the linker between DHFR and GST was attempted by resuspending the glutathione Sepharose in 6 ml of PBS and incubating with 200 µl of 100 µg/ml Factor Xa (Amersham) at 4°C for 24 h with rotation. Protein cleaved from the bound GST was initially collected as flow-through from the column. Subsequent washing of the column with 6 ml of PBS retrieved any additional protein. Proteins eluted from the column with glutathione and those cleaved with Factor Xa were analyzed by SDS and native polyacrylamide gel electrophoresis.
Sample preparation
All GST proteins were buffer exchanged into PBS using HiTrap
Desalt columns (Amersham). Pooled samples of GST-DHFR were concentrated in Centriprep and Centricon YM-30 concentrators (Millipore) whereas pooled GST samples required Centriprep and Centricon YM-10
concentrators (Millipore). Protein concentrations were determined by
Bradford assay (Biorad). For R2, SAXS measurements were performed at
2.5, 5, 10, and 20 mg/ml; GST measurements were performed at 3, 6, 7.8, and 24 mg/ml; GST-DHFR measurements were performed at 3.7, 5.4, 8.1, and 14.4 mg/ml; nonclaret disjunctional (ncd) measurements were as
described (Svergun et al., 2001b
).
Scattering experiments, data processing, and analysis
The experimental x-ray scattering data from protein solutions
were collected following standard procedures using the X33 camera (Boulin et al., 1986
, 1988
; Koch and Bordas, 1983
) of the European Molecular Biology Laboratory on the storage ring DORIS III of the
Deutsches Elektronen Synchrotron with multiwire proportional chambers
with delay line readout (Gabriel and Dauvergne, 1982
). The data
processing (normalization, buffer subtraction, etc.) involved
statistical error propagation using the program SAPOKO (D. I. Svergun and M. H. J. Koch, unpublished data). The scattering patterns from R2, GST and GST-DHFR were recorded at sample-detector distances of 3.2 m and 1.4 m, and the wavelength
= 0.15. The scattering patterns recorded at the two sample-detector
distances were merged to yield the final composite curves to cover the
range of momentum transfer 0.1 nm
1 < s < 5.2 nm
1. Additional details of
the experimental procedures and the ncd data collection are described
elsewhere (Svergun et al., 2001b
). The value of
Dmax was determined from the
scattering patterns using the orthogonal expansion program ORTOGNOM
(Svergun, 1993
). The x-ray scattering patterns for simulated examples
and those from the incomplete atomic models of proteins were computed
from the structures taken from the PDB using the program CRYSOL
(Svergun et al., 1995
). The models without a one-to-one residue
correspondence were superimposed using the program SUPCOMB (Kozin and
Svergun, 2001
), and those with such correspondence were computed with
the algorithm (Kabsch, 1978
).
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Computer programs and testing
The programs CREDO, CHADD, GLOOPY, and CHARGE all run on IBM
PC-compatible machines under Windows 9x/NT/2000/XP and Linux as well as on major Unix platforms. To reduce the time required for
computations, the model scattering intensity and the penalties are not
recomputed after each modification of the structure but rather updated
as previously described (Svergun et al., 2001a
). All the programs are
able to take into account particle symmetry by generating symmetry
mates for the residues in the asymmetric unit (point groups P2 to P6
and P222 to P62 are supported). The programs were tested on simulated
examples to adjust the parameters of the SA procedures. The values
T0 = 10
3,
NT = 5000
ND and
= 0.9 were
found to ensure convergence. The default values of the penalty weights
for different algorithms are summarized in Table
1.
|
Method validation using simulated examples
To validate reconstruction procedures, a simulated fusion protein
was constructed using the crystallographic coordinates of hen egg-white
lysozyme (129 residues, PDB file 6lyz) (Diamond, 1974
) as the
N-terminal domain, with bovine pancreas trypsin inhibitor (BPTI; 58 residues, PDB entry 4pti) (Marquart et al., 1983
) fused to the
C-terminus of the former protein (Fig. 4,
a and b). The theoretical scattering curve of the
fusion protein was computed using CRYSOL (Fig.
5, curve 1) and was
then used to reconstruct the structure of the BPTI domain assuming that
the lysozyme structure is known. The program CREDO was used to fit DR
models to the simulated data yielding a reasonable representation of
the overall shape of the BPTI. However, in some cases the BPTI domain
was not oriented next to the C-terminus of lysozyme as in the original
simulated fusion protein (see typical example in Fig. 4 a).
This is not surprising given that information about the interface
between the proteins is missing in the CREDO reconstruction. It is
interesting to note that even though the location of the interface is
incorrect, the overall low-resolution structure of the restored models
after appropriate rotation and translation agrees well with that of the
simulated fusion protein (Fig. 4 a). To improve relative
domain orientation, we used the program CHADD, which explicitly uses information about the location of the interface. In the example presented here, the C-terminus of lysozyme was identified as the fusion
point. The shapes of the resulting added domains obtained in
independent runs of CHADD were consistent with the crystal structure of
BPTI, although their positions varied by 0.3-0.5 nm. Fig. 4
b presents an averaged result of 10 independent runs, which
predicts the correct position and shape of the BPTI domain fairly well.
|
|
To validate the loop reconstruction procedures, several lysozyme models
were made containing deletions in the following regions: 1) residues
120-129 located at the C-terminus, 2) residues 1-15 containing an
-helix located at the N-terminus, and 3) residues 40-55 containing
a
-sheet located on the surface of the structure. First, the
theoretical scattering pattern of the intact protein was calculated
using CRYSOL. Using this scattering pattern and the coordinates of each
deletion model, missing loop regions were reconstructed using the
program GLOOPY. In all cases, theoretical scattering curves of the
reconstructed proteins, obtained after addition of the missing loop
regions, gave good fits to the simulated scattering pattern of the
intact protein (Fig. 5, curves 2-4). When
compared with C
coordinates of the crystal
structure of lysozyme, typical restored models (Fig. 4,
c-e) have an overall RMSD equal to 0.17, 0.24, and 0.25 nm for deletions 1, 2, and 3, respectively. For comparison,
generation of the missing fragments as random-walk self-avoiding chains
yields the average RMSD values of 0.37, 0.53, and 0.51, respectively.
Use of the program CHARGE for deletion 2 forces residues 5-15 to form
an
-helix, thus further reducing the RMSD (to ~0.15-0.2 nm;
results not shown).
Conformational mobility in small loops/domains
The failure to observe structural elements in electron density maps arising from protein crystal structures is often due to conformational mobility or heterogeneity. The application of reconstruction methods offers the possibility of constructing a model for the missing loops or domains both in terms of their structure and their position in three-dimensional space. Two examples are presented below that illustrate these concepts.
In the first example, a truncated form of the Drosophila
motor protein ncd was studied using SAXS (Svergun et al., 2001b
). The
native ncd protein is 700 residues in length. A construct named MC6 was
made that expresses the C-terminal 368 residues (M333-K700) of ncd.
This construct appears to be monomeric in solution as it lacks an
N-terminal coiled-coil region (residues 196-347) that would otherwise
mediate dimerization. Using crystallographic coordinates of a ncd
variant (PDB entry 1cz7) (Kozielski et al., 1999
), a partial
three-dimensional model of MC6 was produced (Svergun et al., 2001b
).
This model lacked the 33 C-terminal residues absent in the crystal
structure. The scattering curve computed from the MC6 model fails to
fit the scattering pattern of the protein in solution (Fig.
6, curve 1;
= 1.98). Addition of the missing loop using the programs
GLOOPY and CHARGE in these studies significantly improved the fit
(
= 0.89). The loop conformations yielded by the programs in
different independent reconstructions are similar to each other,
suggesting a fan-like manifold of orientations (Fig.
7 a). In an earlier study
using trial secondary structure motifs, this region was modeled as an
antiparallel two-stranded
-sheet (Svergun et al., 2001b
). The
conformation of this tentative model fits within the plane of the fan
and is also of a similar length as the conformations provided by GLOOPY
and CHARGE. These results suggest that the loop is flexible in
solution, moving predominantly in the plane of the fan.
|
|
The second example illustrates the use of information about secondary
structure for the reconstruction of a missing loop. Specifically, the
crystallographic model of a homodimeric protein R2 of ribonucleotide
reductase from E. coli (PDB entry 1xik; molecular mass = 79 kDa) (Logan et al., 1996
) was solved to 1.7-Å resolution
containing 341 residues per monomer. The C-terminal 35 residues are
missing in the crystal structure, and the scattering curve computed
from the crystallographic model displays small but significant
systematic deviations from the experimental data (
= 1.30; Fig.
6, curve 2 and inset; S. Kuprin,
Karolinska Institute, Stockholm, Sweden, personal communication,
1998). According to secondary structure prediction programs
(Cuff and Barton, 1999
, 2000
; Cuff et al., 1998
), a major portion of
the missing fragment (residues 345-373) is predicted to form an
-helix. Fig. 7 b shows the position of a typical
reconstruction of the fragment using the program CHARGE, which gives a
significant improvement in the fit to the experimental data (
= 1.07). The result suggests that the
-helix from each monomer subunit
extends away from the core structure of the protein to produce a
biantennary structure in the dimer. This structure is likely to occupy
a number of conformations, which is consistent with the lack of
interpretable electron density in the original crystal structure (Logan
et al., 1996
).
GST-fusion protein domains
The pGEX series of vectors (Amersham) (Smith and Johnson, 1988
)
are designed to enable inducible, high-level intracellular expression
of genes as fusions with the Schistosoma japonicum GST, a
26-kDa protein forming homodimers in solution. Crystal structures are
available for GSTs from a number of sources (Ji et al., 1992
; Parker et
al., 1990
), including recombinant S. japonicum GST purified
from pGEX-3X (Amersham) (McTigue et al., 1995
). This recombinant
S. japonicum GST contains an extra 13 residues at the
C-terminus compared with the native S. japonicum GST, but this linker peptide is absent in the PDB entry (1gta) (McTigue et al.,
1995
). GST was expressed and purified for SAXS data collection and
analysis from similar plasmid in the pGEX series, pGEX-5X-1 (Amersham).
Compared with the native S. japonicum GST, this GST has an
extra 22 residues at the C-terminus. The x-ray solution scattering
pattern from the latter protein is presented in Fig. 6. The scattering
curve computed from a homodimer built from the crystallographic
structure of GST lacking the C-terminal residues yields a poor fit to
the experimental data (Fig. 6, curve 3;
= 1.30). GLOOPY was used to model the missing linker, assuming P2
symmetry for the entire structure. Several independent runs produced
similar extended conformations of the modeled linker (a typical result
is presented in Fig. 7 c). The theoretical scattering of the
GST crystal structure combined with this modeled linker gave a
significant improvement in the fit to the experimental data with
= 0.81.
SAXS analysis was also performed on a GST fusion protein. The
folA gene, encoding dihydrofolate reductase from E. coli K-12 was cloned into the same vector, pGEX-5X-1, from which
GST was expressed. This enabled production of a fusion protein,
GST-DHFR, consisting of the 218 residues of the S. japonicum
GST followed by a 10-residue linker containing the Factor Xa cleavage
site and 158 residues of E. coli K-12 DHFR. Fig. 6
(curve 4) shows the experimental scattering
pattern of GST-DHFR. Models of the dimeric fusion protein were built
using the programs CREDO and CHADD by fixing the structure of dimeric
GST and then adding the linker and the DHFR domain as a variable part
and assuming P2 symmetry for the entire complex. Both programs gave
good fits to the experimental data with
= 1.02 (Fig. 6,
curve 4). The shape and position of the missing
domain for each model reconstructed by the two methods were consistent
with each other and also with the crystallographic model of DHFR (PDB
entry 1ra9) (Sawaya and Kraut, 1997
), as illustrated in Fig. 7
d. It would appear from comparison of the linker regions of
GST and GST-DHFR that the configuration of the linker is different in
each case (cf. Fig. 7 c). Attempts to cleave the linker
region of GST-DHFR using Factor Xa were unsuccessful, even when the wt
% was increased above 1% and the incubation time was >16 h. Even in
the presence of 0.05% SDS, the percentage of protein cleaved was
minimal. Resistance to proteolytic cleavage may well be due to a more
compact conformation of the linker region in GST-DHFR and/or steric
hindrance causing the Factor Xa cleavage site to be inaccessible.
Although there are no crystal structures in the PDB of proteins >40
residues fused to GST, comparison of the structures of two GST fusions (Fig. 8, a and b),
PDB entries 1gne (Lim et al., 1994
) and 1bg5 (Zhang et al., 1998
), with
the model of GST-DHFR shows that there appears to be a similar
orientation of GST and its fusion partner in all three cases.
|
| |
CONCLUSION |
|---|
|
|
|---|
To summarize, four algorithms have been written to provide an
appropriate tool for each of the various situations in which a
structure lacks a loop or domain. The choice of method depends on the
information available regarding the known part of the model, the
missing fragment, and the interface. If a low-resolution model of the
known part is available (e.g., from electron microscopy or from SAXS by
ab initio methods (Svergun, 1999
; Svergun et al., 2001a
)), the location
of the interface is usually unknown and the missing fragment can be
added using the program CREDO. In this case, the result is a
low-resolution model of the domain structure of the complex. For
high-resolution models, the programs CHADD and GLOOPY can build missing
loops and domains attached to specific residues(s). Furthermore, GLOOPY
tries to construct native-like folds by accounting for excluded volumes
of side chains, hydrophobic interactions, knowledge-based potentials,
and the C
bond and dihedral angles. If the
secondary structure of the missing portion is known, the program CHARGE
allows additional constraints to be applied to the model by
incorporating
-helices and/or
-sheets in the variable fragment.
As the model of an interconnected C
chain used
by CHARGE is less flexible than a free gas of residues implemented in
the other programs, CHARGE is better suited to reconstruct missing
loops rather than missing domains. The main features and possible
applications of the four algorithms are summarized in Table 1 and Fig.
8.
Even though the programs CHADD, GLOOPY, and CHARGE yield the missing
fragments in the form of folded C
chains,
these should be considered as approximate models only. Solution
scattering, being a low-resolution method, does not provide an exact
fold but rather a probable configuration of the volume occupied by the
missing portion. In all of the above algorithms, scattering from the
model is computed using Eqs. 1-3, which do not explicitly take
averaging over possible different conformations of flexible loops or
terminal fragments into account. Such an average would not
significantly influence the results given the low resolution of the
scattering data but would take much longer computation times. The
methods are trying to obtain a single equivalent conformation of the
missing domain, and it is also useful to analyze the results of several
independent SA runs to generate averaged probability maps. This
analysis allows refinement of the shape and position of missing domains
(see Fig. 4, a and b) and better visualization of
regions occupied by the missing loops (Fig. 7 a). When using CHARGE, care must be taken not to restrict the model too much based on
secondary structure predictions (which generally are no better than
70% accurate). In the above example for R2, all major techniques
predicted an
-helix with high probability, which made it possible to
use a long helical fragment for constructing the model in Fig. 7
b.
Missing loop residues can be added to known high-resolution structures
using homology modeling (Mendelson and Morris, 1997
; Perera et al.,
2000
). In general, the reliability of structures produced using
homology modeling is high for short loops but decreases as the length
of the fragment to be added is increased. Using solution scattering,
the situation is precisely the opposite: the larger the missing
fragment, the more significant its contribution to the entire
scattering pattern, and the missing residues can be modeled more
reliably. In practical terms, one can expect the methods presented here
to be useful for missing fragments consisting of ~5-10% of the
entire structure (20-40 residues for a 50-kDa protein) and higher. For
shorter loops, homology modeling may be sufficient; however, solution
scattering data can be used as an additional restraint (in particular,
for rigid body refinement of the orientation of the fragment to be
added) and can also be used for validation of the final model (see,
e.g., Zheng and Doniach, 2002
). Moreover, it should be stressed that
the methods presented are not limited to amending crystallographic
models with disordered loops but are also applicable to the addition of
missing domains to low-resolution models and to fusion proteins,
especially when no crystals are available.
In the model systems presented here, experimental SAXS data has allowed
the reconstruction of both missing loops and domains, providing a
structural description of disordered regions. The reconstructions are
based on the experimental data of ~1.2-nm resolution, but the actual
resolution of the models may be higher because of the additional
information used. In particular, histogram and angular penalties (Figs.
2 and 3) ensure adequate behavior of the model scattering curves at
higher momentum transfers. In the case of the Drosophila
motor and R2 ribonuclease reductase proteins, modeling predicts
extended structures from the surface of the globular core. Such
structures could indeed show large flexibility in solution, which may
explain why the regions could not be modeled from the crystallographic
electron density maps. Modeling studies of GST expressed from the pGEX
system give a description of the linker region, which was not visible
in the original crystal structure (McTigue et al., 1995
). The model of GST-DHFR also provides a visualization of how such fusions appear in
solution. In particular, the linker region appears to adequately separate both globular domains, suggesting that GST does not, per se,
directly influence folding of its partner in protein-protein interactions. In addition, the model shows how the fused protein (i.e.,
DHFR) can occlude the linker, resulting in resistance to protease
digestion in this case. Taken together, these examples demonstrate how
such reconstruction methods using SAXS data have the potential to add
missing fragments to available high- or low-resolution protein models.
Indeed, as three-dimensional structural information from larger
multi-protein complexes emerges, the true potential of these techniques
may be realized for modeling both domains and interfaces responsible
for macromolecular assembly where inherent flexibility and
conformational heterogeneity limit high-resolution visualization.
Modeling using the protein structure representation as an ensemble of
residues could also become useful for the interpretation of
low-resolution crystallographic maps (Guo et al., 1999
).
The executable codes of the programs CREDO, CHADD, GLOOPY, and CHARGE
are available as Wintel
-releases from the EMBL-Hamburg website
(http://www.embl-hamburg.de/ExternalInfo/Research/Sa