MSI Product Previous Next Contents Index Top
Converter



2       Theory


Two- to Three-Dimensional Conversion

The key to providing the ability to produce molecules from a sketch rests upon having a reliable, rapid, two- to three-dimensional conversion methodology. Several approaches to solving this problem have been used in the past. These included direct energy minimization of the initial 2D structure using a force field; assignment of bond lengths, bond angles, and dihedral angles according to rule-based logic, together with special ring perception code for complicated fused rings; and various combinations of these rules, followed by some kind of (usually crude) energy minimization.

A less conventional approach is the direct application of distance geometry techniques to generate a 3D structure for small to medium sized molecules (up to about 100 atoms). Distance geometry has traditionally been used as a tool for generating protein structures in solution in conjunction with distance constraints derived from NOE data. Its application to small molecules is perhaps less widely known, although some published examples exist.

The basic approach taken during the two- to three-dimensional conversion process is to assign distance bounds to all atom pairs of the molecule, together with planarity and chirality (using wedged bond information) constraints, and then to supply the bounds and constraints as input to a distance geometry program for generation of 3D coordinates. In this case, the bounds and constraints are supplied to the DG-II program package (Havel 1991) for generation of 3D coordinates. The particular sequence of steps employed here involves: 1) triangle bounds smoothing; 2) Havel's metrization algorithm for improved sampling; 3) embedding into four dimensions; and 4) four-dimensional minimization followed by three-dimensional minimization, which helps to satisfy complicated chirality constraints, if they exist. The 1991 paper by Havel (and references therein) contains more complete descriptions of the distance geometry algorithms.

After the 3D coordinates have been generated, the molecular structure is appended to the output database file. The structure can be read into the Insight II software with the Molecule/Get command.


Distance Geometry

The DGII program package is based on the EMBED algorithm, which consists of the following steps:

1.   Compute a complete set of bounds via a process known as bound smoothing.

2.   Make a random guess at the values of the interatomic distances, from within the bounds obtained by bound smoothing. This process is known as metrization.

3.   Find atomic coordinates such that the distances calculated from these coordinates are a "best-fit" to this guess. Because this best-fit is computed by a projection method, this step is called embedding.

4.   Minimize the deviations of the coordinates from the distance bounds as well as the chirality constraints (see below) by any of the many available methods. This step is called optimization.

To avoid local minima in the optimization step, it is sometimes desirable to compute 4D coordinates for use in Steps 3 and 4 (see Computation of Distances in Four Dimensions). Since all the mathematics on which these procedures are based carry over without change to higher dimensions, it is sufficient to restrict ourselves to three dimensions in what follows.

Describing Molecular Conformation

The fundamental premise on which distance geometry is based is that the set of accessible conformations can be aptly described by means of suitable distance and chirality constraints.1 The first are simply lower and upper bounds on the interatomic distances; the second include the handedness of the asymmetric centers in the molecule, together with the planarity of any sp2 hybridized centers. This description, called a distance geometry description, is essentially the computer analogue of a CPK model. A large body of mathematical results exists that can be used to better understand these descriptions, much of which can be found in the recent book by Crippen and Havel (1988).

Covalent Distance Constraints

The local covalent structure of a molecule is easily defined by distance constraints. Unless you are dealing with a highly strained ring system, it is sufficient to use only exact distance constraints in which the lower and upper bounds are equal. For example, the distances among covalently bonded pairs of atoms are determined with high precision by the order of the bond and the types of the atoms it connects. Similarly, the bond angles can usually be determined from the covalent structure. For fixed bond lengths there is a one-to-one relationship between the bond angle and the geminal distance, so that these distances can also be determined. This relation between the geminal distance d13 and the bond angle is given explicitly by the law of cosines:

Eq. 2¯1

where l13 = | d12 - d23 | and u13 = d12 + d 23 are called the lower and upper triangle inequality limits, respectively.

Vicinal Distance Constraints

Similarly, when the bonded and geminal distances are fixed, there is a one-to-one relation between the absolute value of the torsion angle and the vicinal distance, given by:

Eq. 1            

where l14 and u14 are the cis and trans limits on the 1,4 distance, here known as the tetrangle inequality limits. In particular, by setting all the vicinal distances across a double bond to the appropriate cis or trans limits, we can force the double bond to have any desired stereochemistry. We can even use inexact distance constraints to allow a specified amount of wobble in the torsion angle about the double bond.

Chirality Constraints

The chirality 1234 of an ordered quadruple of points numbered 1, 2, 3, 4 is given in terms of their Cartesian coordinates by the sign of the following determinant:

Eq. 2            

It is well known that the absolute value of this determinant is six times the volume of the tetrahedron whose vertices are the given points, and hence the value of the determinant itself is known as the oriented volume. Note also that the sign of oriented volume changes whenever the x, y, or z coordinates are multiplied by -1, i.e., whenever the tetrahedron is reflected in a plane. This shows that the chirality of the quadruple of atoms bonded to an asymmetric carbon atom is capable of distinguishing between a molecule and its mirror image. In the special case of a planar quadruple of atoms, the volume and hence the chirality are zero.

More generally, if we know the chirality of every quadruple of atoms in a molecule that is constant throughout all the molecule's conformations, then (together with the above-mentioned distance constraints) we know everything there is to know about its stereochemistry. Thus, if you are computing an unknown structure and you constrain the oriented volumes in it to have the correct signs, you can be sure that it also has the correct stereochemistry. These sign conditions are called chirality constraints. Although the chirality of a quadruple changes sign if any pair of atoms in it are swapped, for most purposes we can just number the atoms in any order and use this same order for the atoms in each quadruple.

Steric Distance Constraints

Since two atoms cannot be in nearly the same place at the same time, in order to obtain reasonable conformations it is also necessary to impose lower bound constraints on the distances between all pairs of atoms separated by more than three bonds. For the sake of simplicity, these lower bounds are generally set to the sum of suitable hard sphere radii, i.e., lij = ri + rj.

Triangle Bound Smoothing

This step of the EMBED algorithm predicts the range of values that the interatomic distances can assume in any conformation of the molecule that satisfies the given distance bounds. Hence these ranges, known as the Euclidean limits, are necessarily at least as tight as the given bounds. The calculation of the Euclidean limits is a difficult and unsolved problem, but it is possible to calculate loose approximations. The term "loose" means that these approximate limits are wider than the actual Euclidean limits, so that no conformations are excluded by them that were not excluded by the given bounds. In addition to its role in the EMBED algorithm, bound smoothing can sometimes detect inconsistencies in the bounds. Converter uses a program in the DGII package for bounds smoothing that employs triangle inequality.

Triangle inequality bound smoothing is based upon the well known triangle inequality among the distances:

Eq. 3            

for all triples of atoms i, j, k. It follows that if dik uik and djk ujk, then:

Eq. 4            

so if uij > uik + ujk, then uij > dij, and hence uij can be replaced by the upper limit uik + ujk on dij without eliminating any conformations that satisfy the constraints dik uik and djk ujk.

Similarly, since the distances also obey the inverse triangle inequality:

Eq. 5            

it follows that if lij < lik - ujk, then we can replace lij by its lower limit lik - ujk. Note these two substitutions are valid even if no lower and/or upper bound is available on the distance dij, since then we effectively have lij = 0 and/or uij = . Moreover, if such a lower limit exceeds the upper bound (or an upper limit) on the same distance, then we have essentially proved by contradiction that there exists no conformation that satisfies the bounds, i.e., that the bounds are geometrically inconsistent (as defined above). We call this a triangle inequality violation.

It has been shown that the exhaustive application of these two substitutions results in a unique set of complete bounds on all the distances, called the triangle inequality limits. These can be efficiently computed by converting the problem to a shortest-paths computation in a directed digraph (network) and then applying Floyd's shortest-paths algorithm (Dress and Havel 1988). This algorithm is O(N)3, meaning that the amount of computer time it requires can increase as, at most, the cube of the number of atoms N. In the event that a triangle inequality violation is found, the shortest paths make it possible to isolate a generally small set of bounds wherein the erroneous constraint must be found.

Metrization

With a sufficiently complete and precise set of distance constraints, reasonable coordinates can be obtained simply by choosing the distances independently from between their lower and upper limits with a uniform distribution. To obtain the widest possible sampling of conformation space, however, a procedure known as metrization is advisable (Havel 1990). The basic idea behind it is simple enough. Since Euclidean distances obey the triangle inequality, a better initial approximation should be obtained by taking this into account. Moreover, since the triangle inequality induces correlations among the distances, the structures that best fit these distances should exhibit large-scale, correlated differences among their parts and hence should be more diverse.

To see how metrization itself works, it is necessary to know several things. First (as shown by Dress and Havel 1988), the triangle inequality limits are equal to the extremes that the distances can assume in any metric space2 consistent with the bounds. Second, since the set of all metric spaces is convex, all values of the distances between their triangle inequality limits actually occur in some metric space consistent with the limits. It follows that if we arbitrarily set any distance to any value between its limits, we may then set the corresponding upper and lower bounds to that value and recalculate the new triangle inequality limits. Repeating this procedure until all the distances have been set thus gives us a metric space whose distances lie within their triangle inequality limits (and hence the given bounds as well).

One problem with this procedure is that if the triangle inequality limits among N atoms must be recalculated from scratch each time one of the ca. N(N - 1) \xda 2 distances is set, the amount of time required overall increases as O(N5). Fortunately, if the distances are chosen in a certain order, it is possible to improve this to an O(N3) algorithm. This order is dictated by the use of a data structure known as a shortest-paths tree, which enables the limits from one atom to all the others to be efficiently recomputed after each change. Thus, the distances from one atom to all the other atoms must be set first, followed by the distances from the next atom to all the remaining atoms, and so on. With this order, the algorithm is known as prospective metrization. It is, however, also possible to compute the shortest-paths tree from each new atom to all the preceding atoms, fix these distances, and proceed to the next atom, and so on; in this case the algorithm is called retrospective metrization.

Because the previously chosen distances affect subsequent choices, to obtain good sampling it is necessary to use a different random numbering of the atoms for each new distance matrix chosen. In fact, in DGII the order of the atoms is permuted yet again each time a new shortest-paths tree is computed, just to make absolutely sure no bias arises from the use of the same order in every tree. This is done automatically by the program.

Under some circumstances it may be necessary to control the overall density of the atoms. This can be accomplished by skewing the distances towards higher or lower values each time one is chosen. An exponential probability density Pr(dij = x) exp (,x) is used and the exponent is chosen so that the expected value of the distance is given by:

Eq. 6            

lij and uij are the current limits on dij. Thus, as the dilation factor , the expected value dij max (lij, uij) = uij, while as 0, the expected value approaches the geometric mean of the lower and upper limits, . If = 1, then dij = (lij + uij) \xda 2,

and we obtain a simple uniform density. (Proofs of these statements are left to the reader.) By averaging multiple independent selections, the distribution can also be biased towards the mean.

Embedding

The basic idea behind embedding is that if you can compute coordinates whose distances are a close fit to the trial distances, then, since the trial distances obey the given distance constraints, the distances calculated from these coordinates ought to at least come close. Of course, if you had to rely on optimization to compute these coordinates, you might as well just fit the distance constraints directly, but it turns out that matrix best-fits such as these can often be computed exactly by eigenvalue methods (Crippen and Havel 1978).

First, suppose you have been given a matrix of the squared distances among N + 1 points numbered 0, ..., N in 3D space:

Eq. 7            

Consider the eigenvalue decomposition of the N X N metric matrix:

Eq. 8            

where W = [wi, ..., wN] are the eigenvectors and = diag (1, ..., N) is a diagonal matrix of the eigenvalues in decreasing order. Since M can be written as the product of the rank three matrix P = [p1 - p0, ..., pN - p0] with its transpose, it itself is positive semidefinite of rank (at most) three, so that 1, 2, 3 0 and 4, ..., N = 0. If you now let 1 \xda 2 be the 3 X 3 diagonal matrix of square roots diag (11 \xda 2, 21 \xda 2, 31 \xda 2) and Q = [q1, ..., qN] = 1 \xda 2 [w1, w2, w3], then the 3D Cartesian coordinates given by qi for i = 1, ..., N and q0 = 0 have the same metric matrix as the points pi, i.e.:

Eq. 9            

As a consequence:

for i = 1, ..., N, while for 0 i j 0, you have:

Eq. 10            

Thus, the distances obtained from both sets of coordinates are also the same, so the coordinates differ only by translation and rotation. The orthogonality of the eigenvectors, however, implies that the new coordinates qi are principal axis coordinates.

Of course, the principal axis coordinates can be found more efficiently by diagonalizing the inertial tensor. The advantage of the above method is that it can be applied even if we do not have coordinates for the points, but only the distances among them. This gives you a means of converting those distances into coordinates directly, which (unlike triangulation and related techniques) is very insensitive to errors in the distances. Although the coordinates obtained from arbitrary distances are usually (N - 1)-dimensional, it can be shown (Crippen and Havel 1988) that, if only the three largest eigenvalues are used, the 3D coordinates obtained in this way constitute an optimum fit to the distances in the sense that the rms difference between the metric matrix calculated from the random distances 1 \xda 2 [d20i + d20j - d2ij] and the metric matrix calculated from the resultant coordinates [qi · qj] is minimum.

Since the distances to the zero-th atom affect every element of the metric matrix, the zero-th atom contributes more to this fit than the other atoms, and so, once again, further steps are necessary to obtain good sampling. This could be done by randomly renumbering the atoms as was done in metrization, but a more elegant and efficient method of eliminating this preference for any one atom is to invent a new atom at the centroid to serve as the zero-th atom. This is possible because the distances to the centroid can be calculated directly from the distances among the atoms by means of the formula:

Eq. 11            

As can be seen, all the atoms except the i-th appear symmetrically in this formula, and hence no one atom is preferred as a whole. In addition, the use of the centroid as the origin further improves the stability of the algorithm.

While the most efficient way of calculating the largest eigenvalue of a matrix to machine precision is inverse iteration, such precision is not worthwhile in the present context. For this reason a variant on the classical power method is employed, which uses Tchebychev polynomials instead of simple matrix products to accelerate the convergence. Starting from a random vector v, the product of these matrix polynomials with v can be computed recursively as follows:

Eq. 12            

As k , the polynomial product Tk (M) v converges to an eigenvector w1, whose eigenvalue 1 is maximum in magnitude. The next-largest eigenvalue is then found by applying the same procedure to the deflated matrix:

Eq. 13            

and so on. The overall time required by this procedure increases as only O(N2), on average.

If the available distance bounds are very loose, the trial distances are often far from Euclidean, in which case one or more of the three eigenvalues of maximum magnitude may be negative. When this occurs, that eigenvalue is skipped, the metric matrix deflated with respect to it, and the next eigenvalue is computed, etc., until three nonnegative eigenvalues have been found.

Optimization of the Coordinates

To reduce the violations of the constraints by the embedded coordinates to an acceptable level, further optimization is necessary. This involves minimizing a function that measures the total violation of the constraints by the coordinates, which is called an error function. These minimizations can be done using any gradient method, e.g., conjugate gradients. Because the coordinates obtained by embedding are much better than random, multiple stages involving different error functions are not necessary to obtain good convergence.

The Error Function

The error function used by DGII is given by:

Eq. 14            

where:

Eq. 15            

Eq. 16            

Eq. 17            

Here, Aij, Bij, and Cijkm enforce the hard-sphere lower bounds, the remaining lower and upper bounds, and the chirality constraints, respectively. The parameters , , and are the weights of the various terms relative to the upper distance bounds, while (typically 1.0) prevents the upper bound errors from getting too large when small upper bounds are present. The function vijkm (P) is the oriented volume, as defined above, and lijkm, uijkm are the lower and upper limits on its value derived from the distance, chirality, and torsion angle constraints.

This error function is relatively smooth and its gradient can be evaluated very fast, both of which are important advantages in global optimization. A full matrix version of the above function is used to evaluate the error with respect to the full set of distance limits obtained by bound smoothing. Variations on this function exist for 3D and 4D, as is described below.

Computation of Distances in Four Dimensions

A specialized method of avoiding local minima is available with DGII: the computation of distances in four dimensions.

The 4D coordinates are embedded and the 4D version of the error function is applied to them. The 4D error function includes an additional dimensionality error term to drive the fourth coordinates ti towards zero, and is given by:

Eq. 18            

where is a dimensionality weight. The initial values of the fourth coordinates should be scaled so that drastic fluctuations in the dimensionality error are not observed. Though it may seem strange, this artifice enables many problems with the local chirality of the molecule to be fixed.

Minimization and Analysis

The coordinates obtained from this procedure are generally close to a minimum (hopefully a near global one) and should also be quite close to being 3D. To make them perfectly 3D while also making the residual violations of the constraints as small as possible, the coordinates are usually projected down into 3D (if necessary) and the error function minimized to a precise minimum via a conjugate-gradient procedure. The number of parameters to play with here is quite large, but the defaults generally work well enough for most problems so that a detailed discussion is not needed. The recommended number of iterations is 1000, with early termination only if the rms gradient norm falls below ~ 0.001.



MSI Product Previous Next Contents Index Top
1 The term restraints is used almost interchangeably with constraints; strictly speaking, constraints are geometric conditions, while restraints are the pseudo-potentials used during dynamics and minimization to enforce consistency with the constraints.

2 A metric space is essentially just a distance matrix whose elements obey the triangle inequality.

Last updated March 14, 2000 at 04:24PM Pacific Standard Time.
Copyright © 2000, Molecular Simulations Inc. All rights reserved.