| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Biophys J, August 2002, p. 619-632, Vol. 83, No. 2
and
*Biocomputing Unit, Centro Nacional de Biotecnologia
(CSIC), Campus UAM, Cantoblanco, 28049 Madrid, Spain; and
San Diego Supercomputer Center, University of
California at San Diego, La Jolla, California 92093-0505 USA
| |
ABSTRACT |
|---|
|
|
|---|
In the present work we develop an efficient way of representing the geometry and topology of volumetric datasets of biological structures from medium to low resolution, aiming at storing and querying them in a database framework. We make use of a new vector quantization algorithm to select the points within the macromolecule that best approximate the probability density function of the original volume data. Connectivity among points is obtained with the use of the alpha shapes theory. This novel data representation has a number of interesting characteristics, such as 1) it allows us to automatically segment and quantify a number of important structural features from low-resolution maps, such as cavities and channels, opening the possibility of querying large collections of maps on the basis of these quantitative structural features; 2) it provides a compact representation in terms of size; 3) it contains a subset of three-dimensional points that optimally quantify the densities of medium resolution data; and 4) a general model of the geometry and topology of the macromolecule (as opposite to a spatially unrelated bunch of voxels) is easily obtained by the use of the alpha shapes theory.
| |
INTRODUCTION |
|---|
|
|
|---|
The study of three-dimensional (3D) biomolecular structures is central to the understanding of molecular biology. In recent years we have witnessed a steady growth in the number of biomolecular structures resolved by scientists and published for the use of the general scientific community. It is easy to speculate that with rapidly growing interest and funding initiatives in structural genomics (Structural Genomics, special issue of Nat. Struct. Biol., November, 2000), we will soon witness a dramatic increase in the amount of available structural data. While managing large amounts of 3D structural information is a challenge in its own right, the problem is further compounded by the ever-increasing complexity of the structures themselves. It is imperative to develop more sophisticated techniques to analyze, represent, store, and search for complex, 3D structural information.
Among the wide variety of efforts undertaken by different research
groups to manage and maintain databases of 3D structures, the Protein
Data Bank/Macromolecular Structural Database (PDB/MSD) (Sussman
et al., 1998
; Keller et al., 1998
; Berman
et al., 2000
) holds a distinctive place. PDB/MSD was designed
to store and manage 3D structures of proteins solved at atomic
resolution by x-ray crystallography or nuclear magnetic resonance
(NMR). The utility of creating such a database is evident from the
variety of scientific research it has inspired. Artymiuk
(1992)
, Holm and Sander (1993)
, Shindialov and Bourne (1998)
, Westhead et al.
(1999)
and many others have used the system to investigate
structural similarities, biochemical properties, and sequence-derived
information. Also, Edelsbrunner et al. (1998a)
,
Norel et al. (1999)
, Morris et al. (2000)
and others have studied ligand-protein interactions including docking
analysis, focusing on structural information related to the shape, and
geometry of the macromolecules.
From the biological perspective we are interested in the detailed
characterization of macromolecular surfaces and topologies. However,
rather than atomic resolution data as found in the PDB, we focus on
medium resolution data produced by other techniques, in particular
cryomicroscopy. Indeed, recently, a new approach combining
cryomicroscopy with image processing tools has been introduced among
the range of quantitative techniques for the solution of 3D structures
of macromolecular complexes (Baumeister and Steven,
2000
). The strength of this experimental approach is that it
does not require specimens to be arranged into crystal lattices (as is
the case for x-ray diffraction at atomic resolution) or to be below a
given molecular weight (as is the case for NMR). A shortcoming is that
it provides structural information mostly in the resolution range
between 0.7 nm and 2.5 nm, reaching atomic resolution only in some
exceptional circumstances, as for the recently solved structure of
AQP-1 (Murata et al., 2000
). It has been recognized
(Grimes et al., 1999
; Kalko et al., 2000
;
Bohm et al., 2000
; Baumeister and Steven,
2000
) that the lower-resolution structures obtained from these
techniques nicely complement atomic resolution data. Efforts are
underway to fit atomic resolution data from x-ray and NMR into larger
structures solved by cryomicroscopy, much like fitting pieces into a
large puzzle (Volkmann and Hanein, 1999
; Wriggers
and Birmanns, 2001
). A new initiative, called the IIMS project
(integrating information on macromolecular structures), coordinated by
the European Bioinformatics Institute (Keller et al.,
1998
), aims at integrating these complementary pieces from different resolution ranges of information into interrelated databases.
Nevertheless, there is an important technical challenge at the database level in integrating atomic resolution and lower resolution information: medium resolution data have an intrinsically different representation compared to atomic resolution data. Atomic resolution data represent the precise coordinates of atoms forming the structure. In contrast, datasets at medium resolution are presented as density maps defined over a 3D discrete grid in which each point (voxel) has an associated density value. In addition, because medium resolution data resolve larger structures, the typical size of a medium resolution dataset (the number of voxels) may be quite large, and need more complex data management to ensure efficient storage, access, and manipulation.
The goal of the current paper is to develop an efficient way of representing low- and medium-resolution volumetric data in a database. This novel data representation has a number of interesting characteristics: 1) it provides a compact representation in terms of size; 2) it contains a subset of 3D points that optimally approximates the probability density function of the voxel values in a maximum likelihood sense; 3) it makes use of the well-established theory of alpha shapes to represent the geometry and topology of the macromolecule (as opposed to a spatially uncorrelated set of voxels); and 4) it allows automatically segmenting and measuring a number of interesting structural characteristics such as passing channels, cavities, and voids, which can be used for posing interesting queries on substructures of the whole macromolecule.
The proposed representation models the atomic cryomicroscopy data in
terms of a new quantitative technique called the kernel c-means
proposed in Pascual-Montano et al., 2001
. Combining this clustering technique with the theory of alpha shapes
(Edelsbrunner and Mucke, 1994
), we effectively
approximate the overall density distribution of the actual molecule
with a relatively small number of pseudo atoms from which an alpha
complex is built. Once the alpha complex is defined, we can derive a
number of auxiliary metrics that characterize geometry, topology, and
connectivity properties of the entire molecule, which in turn can be
used to search for other molecules having similar geometric properties.
The rest of the paper is organized as follows. In the next section we describe a mathematical technique to represent a 3D volume using a reduced set of points such that their overall probability density function approximates the density profile of the original volume. We refer to this reduced set of points as the "pseudo-atom set" or simply "pseudo-atoms." Subsequent sections provide a brief introduction to the theory of alpha shapes and introduce an algorithm that starts with 3D density maps from electron microscopy, extracts the complete data structure of pseudo-atoms, and builds the alpha complex. A number of interesting structural characteristics that can be derived from our new pseudo-atom-based representation are also provided. Experimental results with both synthetic and real 3D medium resolution density maps are then shown. We will show how the topology and shape of the original biomolecule are indeed preserved in the proposed representation. The experimental results also demonstrate that the representation facilitates automatic and efficient detection and measurement of structural features such as cavities and channels. Finally, we conclude with a summary of our primary results and present a discussion of the new research avenues opened by this work.
| |
VECTOR QUANTIZATION OF 3D LOW-RESOLUTION DENSITY MAPS |
|---|
|
|
|---|
In this section we present a technique to sample the original
density map with a faithful approximation, one that uses a reduced number of points but still retains the overall shape of the original density map. This class of techniques, often called vector
quantization, has been widely used in the literature (Gersho and
Gray, 1992
) for dimensionality reduction. In the domain of
cryomicroscopy, (Wriggers et al., 1998
,
1999
) have used a vector
quantization technique that preserves the geometric relationship
between the original and quantized version of the data for the task of
docking atomic resolution structures into medium resolution maps. More generally, any vector quantization technique selects a specific property of the input data that is optimally preserved in the approximate version.
In this paper we propose a novel neural network method based on a cost
function explicitly designed to compute a set of representative vectors
whose probability density closely resembles the probability density of
the input data. Probability density function (pdf) is a fundamental
concept in statistics. It gives a natural description of the
distribution of a continuous random variable in a given interval. Our
choice to optimally preserve the pdf guarantees that the shape (i.e.,
the spatial distribution) of the alpha complex closely approximates
that of the original volume. The following section will briefly
describe the method. A more detailed description can be found in
Pascual-Montano et al., 2001
.
Kernel probability density estimator clustering technique (kernel c-means)
A natural way for quantizing a given data space is to partition the space into groups (or clusters) in such a way that all data points belonging to any group can be replaced by a representative data point (code vector) for that group. The task of the quantization method is to estimate the groups from the input data. One of the most widely used methods is based on a distance criterion where the representative data items are selected in such a way that the distance from each datum to its closest representative item is minimal (intracluster distance) and the distances among groups or representatives of each group are maximal (intercluster distance). In other words, the aim is to find compact and well-separated clusters in the data. While it is generally true that the set of representative data items (called code vectors) produced by vector quantization techniques tend to approximate the probability density function of the input data, our technique is designed to ensure that the difference between the probability density functions of the actual and the approximate versions of the data is explicitly minimized.
Density estimation is the construction of an estimate of the pdf from
the observed data. Kernel estimators have been widely studied for
density estimation, and abundant literature on this topic is already
available (Parzen, 1962
; Bezdek, 1981
;
Silverman, 1986
), so they will only be briefly
introduced here.
Let Xi
p·1,
i = 1 ... n denote the data items
(represented as n real-valued column vectors of dimension
p); let X
p·1 denote a
variable. The kernel probability density estimator can be formally
described as:
|
(1) |
> 0 the kernel
width parameter that controls the smoothness of the estimated density.
Intuitively, the kernel estimator can be seen as a sum of "bumps"
placed at the observations (data). The kernel function K determines the shape of the bumps while the parameter
determines the width. A commonly used kernel function is the Gaussian kernel:
|
(2) |
The new vector quantization technique
Given n data items of dimension p, Xi
p·1, with
i=1...n, the problem is to find c
surrogate "representative" data items (code vectors)
Vj
p·1, with
j=1...c, such that the estimated density:
|
(3) |
3·1 represent the voxels of the EM volume under
analysis, properly selected to represent their density, and the
resulting code vectors Vj
3·1 will be the "pseudo-atoms" or representative
data items that are going to be used in subsequent steps of the
proposed methodology. Note that the dimension p of the
vectors is 3 in this application (voxels' coordinates).
In general, let D (X;
) be the probability
density for the random variable X, where
is some unknown
parameter vector. Let Xi
p·1, i=1...n, denote the data
items, then:
|
(4) |
is obtained by maximizing Eq. 4. Note that in this
case the parameter vector is composed by the code vectors Vj
p·1 and the kernel
width
, i.e.,
= {{Vj},
}.
From Eq. 3 and 4, the log-likelihood is:
|
(5) |
that maximize Eq. 5. This new method attempts to achieve a vector
quantization in such a way that the generated code vectors tend to have
the same statistical distribution of the original dataset. In other words, the location of the generated code vectors are estimated such that their estimated probability density is as similar as possible
to that of the original data, achieving both goals: vector quantization
and probability density estimation.
The following algorithm solves the previously stated problem:
| 1. | Given the input data Xi 3·1 (input volume voxels), i = 1...n; and given a number of "pseudo-atoms" c;
|
| 2. | Initialize uji c·n, for i = 1...n and j = 1...c, satisfying the following constraints: ![]() |
| 3. | For j = 1...c, compute the new values for the pseudo-atoms Vi by using the following equation: ![]() |
| 4. | Compute ![]() ![]() |
| 5. | For i = 1...n and j = 1...c, compute uji by using the equation: ![]() |
| 6. | Go to step 3 until convergence. |
|
| |
ALPHA SHAPES: CHARACTERIZING TOPOLOGY, SHAPE, AND CAVITIES |
|---|
|
|
|---|
Until now, we have dealt only with density value distributions and ways to sample the input volume into a reduced point set that approximates the original voxel density pdf. However, geometric properties (shape) are not defined for a set of points, although they are essential to understand a broad range of key characteristics of biological macromolecules. Therefore, a further step to obtain the connectivity between these points (i.e., define their shape) is needed.
A number of issues arise along the process of coding topology and shape-related properties from EM medium resolution datasets that make this task especially complicated. Here we will concentrate on two of them.
First, most approaches that treat geometric information require the
definition of some form of molecular surface. There are a number of
well-known molecular surface models for structures at atomic levels of
resolution (Connolly, 1983a
,b
; Connolly et al.,
1996
), primarily because the very nature of
high-resolution data lends itself to the definition of theoretically
precise surface models. This is not true for medium resolution data,
where some segmentation algorithm has to be applied to extract the
actual contour of the 3D object, thus introducing an extra level of
variability and imprecision.
The second issue concerns the obvious fact that features across different resolution ranges may not be preserved. Important geometric characteristics such as clefts and channels may change their shape and size (or actually disappear) when data at atomic resolution and at medium resolution are compared. Therefore, resolution is a key parameter to be carefully considered when sets of volumetric data are compared. A more detailed study of this general phenomenon, particularized to a selected set of macromolecular features, is underway.
We start by first segmenting the contour that delineates the actual 3D
object from its background by the use of Deriche filtering (Deriche, 1987
). Deriche's filter is a variation of
Canny's filter (Canny, 1986
) that lessens its heavy
computational load by using a fast recursive implementation. As
Canny's filter does, it extracts the contour of the image from
gradient information along the three axes (x, y, z). This
voxel-based description of the contour of the original object will be
taken as the reference for shape.
At this stage of the analysis we have, on one hand, both the original
volume of density voxel values and its 3D contour, and on the other
hand, a collection of points coming from the density pdf sampling of
the original volumetric data. Then, it is necessary to construct a 3D
shape using the set of points derived from the pdf sampling. We use
alpha shapes theory (Edelsbrunner and Mucke, 1994
), a
generalization of the convex hull of a point set, to address this
issue. Using this theory, it is possible to associate a family of
shapes to a finite point set in an n-dimensional Euclidean space. Each shape is a well-defined polytope (an
n-dimensional solid with flat faces) derived from the
Delaunay triangulation of the point set, with a parameter
controlling the desired level of detail.
Alpha shapes theory emerged as a powerful geometric concept that has
been successfully applied to the structural biology field
always for
datasets at atomic resolution
by a number of authors
(Edelsbrunner et al., 1995
, 1998b
; Peters et al., 1996
). The
formal definition of shape and topology provided by alpha shapes makes
possible, among other tasks, to efficiently analyze different molecular surface models, and to precisely locate and measure the volume of
cavities, voids, and channels (Edelsbrunner et al.,
1995
). In the following paragraphs we will provide a brief
summary of the alpha shapes theory. For further study, we refer the
reader to (Edelsbrunner and Mucke, 1994
;
Edelsbrunner et al., 1995
; 1998b
, and Akkiraju et al., 1996
.
Given a weighted point P = (p,
wp) where p
n, the power distance from a point
x
n to P is defined
as

These concepts are not new and have been extensively used in structural
biology to derive measures for several surface models (Connolly,
1983b
; Connolly et al., 1996
). They all build a
topological structure based on the regular triangulation of atoms as
weighted points. Every atom is considered as a ball B(p,
rvw)
3 characterized by its
position (center) and van der Waals radius (weight). Indeed, the
weighted Delaunay triangulation defined from the set of atoms of a
given molecule provides us with its underlying topological structure
(connectivity among atoms).
The alpha shapes theory extends all these ideas by introducing a new
growing parameter
. The reader should not be confused with the
parameter used in the vector quantization algorithm. Let's suppose
that all the atoms (balls) of this molecule start to grow
simultaneously by increasing their radii by the
parameter. Then,
every atom will be redefined as a ball
B
=(p, r
) with
radius

increases (see Fig. 2) the balls
gradually grow so that they will eventually overlap. At the moment when
the boundaries of two balls overlap, a new 1D simplex is added (an edge
in this case) to the simplicial complex for that particular value of
. Whenever three balls overlap, a triangle (2D simplex) is created, while a tetrahedron (3D simplex) is added for the case of four intersecting balls. The simplicial complex associated with an
value
is a subcomplex of the Delaunay complex and it is called the
alpha complex. The alpha shape is the part of
space occupied by simplices in the alpha complex. When
= 0 (zero-shape) the actual topological structure of the molecule is
obtained; however, when
tends to
the alpha complex is the
convex hull of the initial set. Simplices in an alpha complex for the
alpha value
1 also belong to that one for the alpha
value
2 with
1 <
2. That is, the alpha complex for a smaller value of alpha is always a
subcomplex of the one for a larger value of alpha, and both are
subcomplexes of the Delaunay complex (convex hull).
|
Relating the alpha complex to the weighted Voronoi decomposition of the molecule does the link with the actual molecule. The alpha complex encodes combinatorial, topological, and metric information about the molecule.
In this work alpha shapes theory will be applied to medium resolution density maps for which information about the atomic coordinates is not available. The basis of our usage of alpha shapes resides in the prior vector quantization procedure, which has provided us with the pdf-based sampling (pseudo-atoms set) from the original 3D image. Our method will proceed by using these pseudo-atoms in a way similar to the way real atomic positions have been used in the context of atomic resolution data. The initial value of the alpha parameter will be taken as the maximum of the quantization errors associated to each pseudo-atom, as described in detail in the next section. In this way, an alpha complex will be obtained as a representation that encodes both the geometric and topological properties of the medium resolution information.
Once the alpha complex has been obtained, the selection of those k-simplices (k = 0...3) that form voids and cavities of any number of mouths is straightforward, allowing the automatic detection and quantification of features such as deep clefts and channels. Other interesting structural features are also possible to detect (e.g., protrusions), although they will be further investigated in future work.
| |
ALPHA SHAPES AND 3D TEM DENSITY MAPS: MODELING GEOMETRY AND TOPOLOGY |
|---|
|
|
|---|
The general schema to derive our proposed volume data
representation is as follows:
| 1. | From the original volume, obtain a point set of "pseudo-atoms" that quantify the volume density. Associated to each pseudo-atom there is a "quantization error." The number of pseudo-atoms is initially arbitrary; |
| 2. | Construct the alpha complex associated to the set of pseudo-atoms, use the largest quantization error as the value for the alpha parameter; |
| 3. | Increase the number of initially computed pseudo-atoms such that the new alpha complex constructed on them captures the geometrical and topological features of the original volume in a well-defined manner and within a precise and user-controlled error limit; |
| 4. | Store the set of final pseudo-atoms and its associated alpha complex. They will constitute the data structure onto which all subsequent density, geometry, and topology analysis will be performed. |
This schema presents a number of algorithmic and mathematical issues. To start with, it is centerpiece the notion of connectivity among sampled points (pseudo-atoms). However, one of the alpha complexes must be selected among those that compose the whole alpha family. The alpha family is the set of all possible alpha complexes of the point set indexed by the alpha parameter. Thus, the selected alpha complex should be chosen such that it best preserves the topological and geometrical characteristics of the original macromolecule volume.
Fortunately, the information produced by the vector quantization procedure is not only the set of pseudo-atoms referred to so far, but also a measure of the vector quantization error associated to each pseudo-atom. The quantization error is the average distance from the pseudo-atom to the input vectors (original voxels) it represents. The maximum of such errors provide us with a good, yet conservative, value for the alpha parameter.
Additionally, to compute certain shape measures from the alpha complex
it is important to provide the capability to obtain a volumetric object
from the alpha complex; in other words, to produce a voxel-based volume
from the list of simplices that composes a particular alpha complex. In
this way, for each alpha complex it is possible to build a discrete
binary image in
3 that corresponds to the space occupied
by its simplices. That is, every simplex is "discretized" into a
set of voxels. A given voxel in the output volume will be assigned 1 as
long as it belongs to any vertex, line, triangle, or tetrahedron of the
alpha complex (and 0 otherwise). Thus, tetrahedra will approximate the
inner mass of the molecule and outer triangles will produce its
contour. We keep track of the real molecular dimensions as we know its spacing; i.e., the number of Angstroms that corresponds to each voxel.
The overall goodness
in terms of topological and geometric fidelity to
the original dataset
of the information contained in the data
structure will depend on two key parameters: 1) the cardinality of the
pseudo-atoms set. As it will be shown, the higher the number is, the
better accuracy is obtained in the voxel-based model rendered from the
alpha complex. 2) the particular choice of the
value for a fixed
number of pseudo-atoms. It must guarantee that the 3D shape and
topological structure obtained from the simplices of the alpha complex
are correct.
To reduce algorithmic complexity, we will fix the
value equal to
the maximum quantization error in the calculation of the pseudo-atoms.
Therefore, only the cardinality of the pseudo-atoms set will be
regarded as a free parameter to be optimized. The optimization is
carried out such that the resulting data representation should
simultaneously fulfill the following two constraints.
Constraints of the data representation
Topology preservation
In our case, the preservation of topology implies that the three Betti numbers
0,
1,
2 are
equal compared to the original volume. One single connected component
(
0) is obtained, and there exist as many cavities,
tunnels (
1), and voids (
2) as in the original dataset. These two conditions are calculated in a direct manner from the alpha complex (Delfinado and Edelsbrunner,
1995Shape preservation
A number of shape measures are calculated for the original dataset and the voxelized model obtained from the alpha complex. The similarity between the original dataset and the model for each of such measures must be greater than a given threshold. We have chosen from the literature up to six 3D shape features. These features are computed for both the original and the alpha shape-derived model. The rationale of this choice is to use different shape features as different ways of "viewing" the geometry of the molecule. Euclidean distance is used to compare them. All shape features but the last one (histogram of normals) are computed from the voxelized version of the alpha complex. The selected shape measures are classified as either boundary-based or region-based features, depending on whether the boundary or the area inside the boundary is coded. For the computation of some of these features it is necessary to define a new reference frame that will be centered at the object's centroid. Thus, we need to compute first the geometric center of the object and its principal axes of inertia. All these features operate on the spatial domain.Region-based measures
Ratio among principal axes of inertia (Galvez and Canton, 1993
0,
1,
2) be the three
eigenvalues sorted by value, so that
0 correspond to the
largest eigenvector and
2 to the smallest. The ratio
among them gives an idea of the flatness, elongation, or roundness of
the object. For instance, if
0 =
1 and
the ratio between
2 and
0 is large, then
the object is mainly flat. If
1
2 and the ratio between
0 and
1 is large, then the object is elongated. If the three
values are approximately equal, the object is mainly rounded. It is
easy to see that this measure is invariant to rotations and
translations (Lohman, 1998Boundary-based features
Image shape spectrum (Nastar, 1997
|
| |
ALGORITHM TO OBTAIN THE PROPOSED DATA REPRESENTATION |
|---|
|
|
|---|
With all the previous considerations, now we are ready to
introduce the actual algorithm used to derive the proposed data representation from a 3D density image V:
| A. | Obtain the binary-valued mask MK from the original density map V. | ||||||||||||||||||||||
| B. | Initialize n = n0 as the initial number of pseudo-atoms. | ||||||||||||||||||||||
|
Two additional remarks must be made. First, the initialization of the number of pseudo-atoms is certainly an important factor in terms of overall efficiency. The kernel c-means procedure is the most time-consuming task, and a good initial estimate is highly recommended. Our particular choice is to start with 10% of the points of the original dataset, with a step increment of 100 pseudo-atoms. This decision does not affect the quality of the final result, but it dramatically improves the speed of the overall procedure. Other heuristics are open to particular choices.
Second, in step B2b some pseudo-atoms were automatically "shifted" from their original positions. The rationale for this approach is that because the pseudo-atoms were selected to convey general density information, those pseudo-atoms that define the contour of the alpha complex will not exactly coincide with the real object's contour. Thus, the voxel-based volume rendered from the alpha complex is somehow "scaled down" with respect to the original dataset. The visual effect is that the contour of the alpha complex appears to be located underneath the actual molecular surface. Step B2b was designed as a heuristic approach that compensates the above-exposed effect whenever a rendering of the alpha complex is requested. It is important to note that our representation model is aimed at preserving geometric and topological features. In contrast, visual fidelity is a criterion of interest not for internal representation, but for visualization. In the following we provide a precise specification of this procedure designed to improve visual fidelity when requested.
First the subset Scontour
S
of pseudo-atoms that defines the contour of the alpha complex is
automatically extracted (by the use of the alpha shapes tools). Then,
for each pseudo-atom contained in Scontour,
we find out which is its nearest neighbor (following the point's
normal direction) located at the actual molecule's contour. In this
way, the new set S'contour is obtained with
the same cardinality of Scontour.
Connectivity relationships (topology) defined by the alpha complex
remain unaltered. Moreover, the shape of the alpha complex as
represented by the six similarity measures described previously is not
affected at the global or local level, as can be appreciated in Fig.
3. Whenever a visualization of the alpha
complex is requested, we render its simplices by using the pseudo-atoms
set S' = (S
Scontour)
S'contour. As a consequence, the set
S'contour needs to be stored in addition to
S.
|
Structural characteristics directly derivable from the alpha complex
The use of the alpha complex provides a powerful means to study a
number of interesting structural features of the object. Using alpha
shapes in conjunction with flow theory (Edelsbrunner et al.,
1998b
) it is possible to directly inspect channels of any
number of branches, pockets, and voids in the structures. In this way
an automatic detection of them is possible, even specifying arguments
related to the total volume or the area of the mouths, and then segment
them out of the structure for further processing.
| |
RESULTS |
|---|
|
|
|---|
Automatic detection and quantification of cavities and channels
The experimental results that we present in this section are aimed at showing that 1) the proposed data representation can be practically obtained with the algorithm presented in the previous section. Also, the representation is robust in the sense that several executions of the algorithm produce almost identical results; 2) this data representation keeps most of the density, topological, and geometric information of the original dataset in a compact manner; and 3) the information is now expressed in a way that allows for an efficient manipulation when inquiring for density, topology, and geometrical characteristics.
Without loss of generality, we will focus our attention on a concrete problem: the automatic detection and measurement of cavities and channels in macromolecules from volumetric datasets representing complex macromolecular structures at medium resolution. This issue is key in our work because it is the only way to accomplish large-scale objective studies. However, most of the tools that exist today for the analysis of medium- or low-resolution structures are manual. Two different datasets have been used in along this work: 1) a low-resolution volume density generated from the atomic resolution data of the bleomycin hydrolase yeast analog Gal-6, and (2) an experimental low-resolution map of the DnaBC complex obtained by cryoelectron microscopy.
The first experiment reported here is aimed at demonstrating the robustness and stability of the methodology. Given a particular dataset, several executions of the algorithm must produce almost identical results. This means that for every execution of the algorithm the positioning of the pseudo-atoms within the volume is stable and the shape, as represented by the alpha complex, is very similar among several executions.
Gal-6 is the yeast homolog of bleomycin hydrolase (Joshua-Tor et
al., 1995
), a cysteine protease that hydrolyzes the anticancer drug bleomycin. The structure of Gal-6 was solved by x-ray
crystallography at 2.2 Å of resolution (Fig.
4). Gal-6 presents a hexameric structure with a prominent central channel. The hexamer has overall dimensions of
125 × 115 × 85 Å. Because of the extensive dimer interaction, the hexamer may be considered as a trimer of dimers. The papain-like active sites are situated within the central channel. The size and
shape of this channel, together with the prominent positive electrostatic potential inside the channel, suggest that it represents the region of Gal-6 involved in DNA binding. Also, Gal-6 may undergo conformational changes upon DNA binding that would better allow accommodation of a DNA helix or hairpin in the channel. The atomic resolution data of bleomycin hydrolase Gal-6 were retrieved from the
PDB and a volumetric map at 20-Å resolution with a spacing of 3.409 Å/pixel was generated. The CCP4 suite of programs (Collaborative Computational Project, Number 4, 1994. The CCP4 Suite: Programs for
Protein Crystallography. Acta Cryst allogr. D. 50:760-763) was used for this task.
|
Following this filtering step, an accurate alpha shape representation of the volume was obtained by executing the algorithm presented above. The similarity threshold used within the algorithm requires that the new data representation had an average shape similarity with respect to the original volume above 95% (as measured by a combination of six shape features). We focus in the behavior of one feature, namely the cross-correlation coefficient (CCC) between the original volume and the voxelization of the alpha complex. It can be readily appreciated from Fig. 5 that the CCC follows a logarithmic-type curve, in which large changes of CCC happen when the number of pseudo-atoms is small, and then the changes become stable. Following the choice of threshold value specified above, the final number of pseudo-atoms required for such an accurate representation is 1500.
|
Stability of the vector quantization technique
Like many other vector quantization methods, algorithms of the type proposed here are known to be sensitive to initial conditions: the local extreme to which the algorithm converges depends on the choice of initial values for V (pseudo-atoms). In general, the average variability will depend on the shape of the 3D input density distribution and the number of vectors. To test the stability of the algorithm under different initialization conditions, we repeated the algorithm 10 times using different random initialization values for the code vectors. For the tests we used the bleomycin hydrolase dataset at 20 Å resolution with a spacing of 3.409 Å. To check the stability effect of an extreme case, we fitted 1500 code vectors and calculated the RMSE (root-mean-square error) for 10 statistically independent calculations. The stability test showed that the overall RMSE fluctuation of the code vectors is 1.79 Å. This result indicates that the position of the code vectors is very stable. We should emphasize that the vector quantization technique used in this study aims at preserving the pdf of the original data, which means that the algorithm will tend to position more code vectors in those zones with high-density values and fewer code vectors in zones with low-density values, so the separation of the code vectors can be very small in high-density areas of the volume. This fact explains why the stability of the code vector's positions alone is not sufficient to demonstrate the stability of the method. For that reason we also tested the stability of the model (estimated probability density) by calculating the log-likelihood of the estimated density for each independent run, obtaining a mean of 233970.8 and a standard deviation of 4.18, which indicates a very stable model.
In addition to the above tests, we also demonstrated that the shape
features coded in the alpha complex also remain stable along different
initializations of the pseudo-atom generation algorithm. To this end,
the alpha-shape A

5 range). Also, the similarity index with respect to
the original object (Table 1, bottom row) remains virtually the same,
with changes appearing only in the sixth decimal digit (in a scale from
0 to 1).
|
|
Similar results are shown in Table 2 for single-valued shape features. Indeed, the values of the features "size of the bounding box," "cross-correlation coefficient," and "eigenvalues of the inertia matrix" are very similar, with changes at most on the sixth decimal digit. Additionally, these values are consistently very close (within the second decimal digit at most) to those derived from the original low-resolution dataset.
Generation of the alpha complex representation from cryoelectron microscopy data
The second dataset used in this study corresponds to an
experimental low-resolution map of DnaB and its loading partner DnaC obtained by cryoelectron microscopy and image processing techniques. DnaB is the main replicative helicase of Escherichia coli.
In general, replicative helicases are motor proteins that unwind DNA at
replication forks. Strand separation in DNA duplexes is a key step in
many cellular events. Recently, the 3D low-resolution structure of
E. coli DnaB hexamer in complex with its loading partner,
DnaC, was obtained at 26 Å (Bárcena et al.,
2001
) (Fig. 6). The structure
presents a global toroidal shape, with a central inner channel that is
closed
at the resolution of this study
at one of the apical faces of
the volume. The maximum external diameter is 13.8 nm and the
reconstructed height is 12.4 nm.
|
This experimental low-resolution map was taken as input to the alpha shape representation generation algorithm presented in above. A number of alpha complexes were generated with an increasing number of pseudo-atoms, and a series of shape features were calculated on them. A gallery with these alpha complexes (constructed from 200, 600, 1000, and 1700 pseudo-atoms) is shown in Fig. 7. The final selected alpha shape representation corresponds to the one shown in Fig. 7 d, as this is the smallest (in terms of the number of pseudo-atoms) representation for which the average of all shape similarity indices with respect to the original low-resolution map is above 95%.
|
Note that the quality of surface representation used in Figs. 6 and 7 is different. While they visually differ, the reader should not mistakenly infer that there exists any difference in the data representation. In the first case, Fig. 6 is generated with a smooth rendering, while the second is a direct representation of the alpha complexes (triangular facets) that allow us to visualize the underlying alpha shape (without the additional step of smoothing for visualization).
It is interesting to study the evolution of the series of alpha complexes shown in Fig. 7, with supporting data in Tables 3 and 4. Table 3 presents a number of geometrical and computational parameters of the alpha complexes. It can be appreciated how the value of alpha decreases as the number of pseudo-atoms increases, which is a logical process considering that the alpha value was taken as the largest of the quantization errors in the generation of the pseudo-atoms. However, the complexity of the resulting alpha complexes increases (number of tetrahedra and triangles) as well as the total number of bytes needed to store the alpha complex. Regardless of this increase, the resulting representation is more compact than the original dataset because the final alpha complex requires about one-fourth of the space with respect to the original volume (a 64-cube coded at one byte per pixel).
|
|
As shown in Table 4, the general trend of all shape similarity measures presents a pattern of rapid increment followed by a slower phase of tuning. This fact is in accordance with the results previously obtained for Gal-6.
Automatic analysis of cavities, channels, and voids
Now, we focus on some advantages of handling the alpha complex instead of the original dataset. In particular, we provide the user with fully automatic extraction and analysis of cavities, channels, and voids that exist in the surface of the macromolecule.
Previously in this section it was noted that Gal-6 has an inner channel
that could be involved in interaction with nucleic acids. In general, a
desirable automatic analysis capability would yield means to detect and
quantify channels of any number of mouths. In fact, this is a quite
direct application of the alpha shapes theory (Edelsbrunner et
al., 1995
; Liang et al., 1998
). Fig. 8, top row, presents a
gallery of views of Gal-6 in which those tetrahedra that define the
passing channel have been selected. Therefore, the particular shape
features of the channel can now be studied in detail. Analogously, in
Bárcena et al. (2001)
it is described that the
channel of DnaBC was closed at that resolution. Thus, a large cavity
along the symmetry axes of the specimen was formed. Fig. 8,
bottom row, shows a set of views of the alpha complex of
DnaBC where solid tetrahedra fill the inner cavity. Now, it is
straightforward to automatically analyze the local properties of this
cavity (for instance, its total measured volume is 15,329 Å3).
|
In a database context, the alpha shape representation allows us to easily pose complex geometric queries such as "Retrieve those specimens that present a passing channel with a diameter compatible with dsDNA."
The examples shown in Fig. 8 are not intended to be an exhaustive presentation of the potential applications opened by the alpha complex volume representation model proposed in this work. Instead, they are aimed at clearly illustrating some of their possibilities. They will be further explored and applied in future work.
| |
CONCLUSIONS AND FUTURE WORK |
|---|
|
|
|---|
In this work we propose a new way to represent low-resolution density maps that encodes, in a simple and compact manner, key information on their density, topology, and geometric characteristics. The proposed data representation allows for fast and accurate computations of macromolecular structural features while it significantly improves the required space that is needed to store each single dataset. It represents the first step toward modeling, storing, and efficiently querying volumetric images in a database infrastructure.
The approach consists of a two-step algorithm. First, a newly developed vector quantization technique termed kernel c-means is applied to select a point set (pseudo-atoms) within the macromolecule that best approximates the probability density function (pdf) of the density values of the original density map. The second step of the method is devoted to define connectivity relationships among points of the point set. To this end, the alpha shapes theory and public available software were used. It has been demonstrated that the alpha complex obtained using a value for alpha directly associated to the quantization error of the pdf estimation using kernel c-means closely approximates the shape and topology of the original molecule.
As a consequence, the resulting data structure is especially suitable for posing complex queries over large datasets of medium-resolution structures involving density, geometry, and topology characteristics. Queries such as "Retrieve those specimens that present a passing channel with a diameter compatible with dsDNA" can now be posed in an accurate and efficient manner. Other type of queries about shape similarity or about accessibility and measures of cavities are also possible.
However, it would be desirable to further refine the current data representation through iterative rou