| Xsight |


To determine a new protein crystal structure (i.e., a structure that cannot be related a previously solved protein), it is necessary to calculate an electron density map in which to build the atomic model. Calculation of the electron density map requires both the set of structure factor amplitudes, which are directly related to the experimentally measured intensity data, and a set of phase angles, which are not directly measurable. The determination of these phases is called the "phase problem" in crystallography.
To derive the phase angles needed to compute an electron density map, two main experimental methods are used in protein crystallography. The two methods are isomorphous replacement (IR or, more commonly, SIR for single isomorphous replacement and MIR for multiple isomorphous replacement) and multiwavelength anomalous scattering (MAD). Both methods require the collection of additional sets of structure factor data and make use of differences in structure factor amplitudes between the data sets to infer the protein phases.
For IR phase determination, the extra datasets are collected from derivative crystals that are isomorphous with the native crystal but which contain strongly scattering atoms specifically bound at a few sites on the protein.
For MAD phase determination, the crystal must contain a few anomalous scatterers (i.e., atoms for which the scattering amplitudes depends strongly on the wavelength of the incident X-radiation), and the structure factor data are recorded from the crystal at multiple wavelengths. These two methods are discussed further below.
For phase determination by the isomorphous replacement method, since the scattering of X-rays by an atom is proportional to its atomic number, heavy metals (for example, gold and lead) are often used to make derivative crystals that give datasets that show significant intensity differences from the native dataset. Ideally, the derivative crystal is identical to the native crystal except for the bound heavy atoms, and the observed scattering differences are entirely due to them. In most real-life cases the bound atoms cause some structural changes in the protein molecule or the crystal lattice, which limits the accuracy of the phase information that may be extracted.
A more fundamental reason why phase information from a single derivative is rarely sufficient to solve the phase problem is that SIR data only determines each phase angle to within a choice of two possible values and additional information is needed to break this ambiguity. Depending on the quality of the data from the derivative crystal, two to four different derivative data sets are often needed to calculate an interpretable electron density map.
An alternative method for obtaining phase information that is being increasingly applied in protein crystallography is to use the physical phenomena of anomalous scattering. This approach makes use of the fact that the magnitude of X-ray scattering from atoms depends on the wavelength of the incident radiation. For the common protein atoms (H, C, N, O) the effect is insignificant, but many heavier elements show measurable variations in X-ray scattering in the readily accessible wavelength range of 0.3-3.0Å.
Since synchrotron X-radiation is easily tunable, it is possible to change the diffracted intensities for a crystal containing a suitable anomalous scatter by changing the wavelength of the incident radiation. Ideally, one may use the same crystal for data collection at four or more different wavelengths, avoiding any possibility of non-isomorphism between crystals. The use of flash-freezing techniques for preserving crystals over the time needed to collect several datasets and the ability to use engineered proteins containing Se-methionine rather than ordinary S-methionine (Se shows strong anomalous scattering effects) have helped make the MAD phasing method increasingly applicable for protein crystallography.
Although the analytical machinery that is used is somewhat different for MIR and MAD phase determination, the general process flow is the same for both methods. The steps involved in phase determination are:
1. Combine and scale together the datasets so that the differences between native and derivative structure factors (for MIR) or data measured at different wavelengths (for MAD) may be obtained.
2. Determine the sites of the bound heavy atoms (for MIR) or anomalous scatterers (for MAD).
3. Refine the sites for the bound heavy atoms (for MIR) or anomalous scatterers (for MAD).
4. Compute the protein phases using the refined values for scattering from the heavy atom or anomalous scatterer sites found in Step 3 and the sets of structure factor data.
In practice, there is often some iteration between these steps. For example, once the heavy-atom sites for the first derivative have been found, calculation of a preliminary phase set may be used to assist in locating the heavy-atom sites in additional derivatives.
To obtain the differences between native and derivative datasets that are needed for IR phasing, it is necessary to place all data on the same relative scale. This task is complicated by differences in crystal quality and decay rate, absorption, and experimental geometry--all of which may lead to unwanted resolution- and orientation-dependencies in the magnitudes of the diffracted data. In particular, the elements that are used as heavy atoms and anomalous scatterers may lead to much stronger X-ray absorption than for the native crystal and also tend to make the crystal more susceptible to radiation damage. Since MIR and MAD phase determinations depend on accurate extraction of relatively small differences between datasets, it is important to parameterize and remove these scale variations as well as is possible.
For MIR phase determination, Xsight uses the Xmerge program from the XtalView system to scale the derivative to the native dataset. The program applies a resolution-dependent "temperature factor" to the derivative data in addition to the overall scale factor (a simple multiplier). Options are available to apply either an isotropic temperature factor, in which case the additional correction is purely resolution dependent, or an anisotropic temperature factor, in which case the resolution-dependence of the scaling is also parameterized along the h, k, and l directions in reciprocal space.
For MAD phase determination, the MADSYS program contains options for calculating and refining scale factors between datasets within resolution shells and according to different batches of data. The program uses parameterized local scaling to remove systematic differences between groups of reflections. The program also can account for the expected differences in the diffraction that result from differential scattering from the anomalous scatterers at different wavelengths.
A critical problem in the early stages of MIR phase determination is locating the heavy-atom sites. The only information available is the set of native protein structure factor amplitudes, |Fp|, and the sets of scaled heavy-atom derivative structure factor amplitudes, |Fph|. In the absence of any prior phase information (i.e., the situation at the start of the project) it is not possible to compute an electron density map to show the heavy-atom positions.
The most commonly used approach to locating heavy-atom sites for the first derivative is to compute a difference Patterson function. In Xsight this computation is carried out using the Xfft program from the XtalView package. The difference Patterson map is calculated by taking the Fourier transform of the set of Fourier coefficients (|Fph| - |Fp|)2 . The difference Patterson map may be understood to be a map of vectors between the heavy-atom sites. The interpretation of the difference Patterson map in terms of a unique set of atomic sites is complicated for space groups that contain many symmetry operations and for derivative crystals containing multiple heavy-atoms sites, since the number of vectors becomes large. Furthermore, the difference Patterson map is really only an approximation to the true vector map and contains intrinsic noise that increases with the number of heavy-atom sites (Terwilliger et al. 1987).
For MAD phasing it is possible to derive `|Fa|' coefficients which are the structure factor amplitudes from the anomalous scatterers alone (Hendrickson 1991). A map made from the Fourier transform of |Fa|2 is interpreted similar to the difference Patterson map, in that it shows the vectors between the anomalous scatterers in the crystal.
Although difference Patterson maps may often be interpreted by manual inspection, automated Patterson-interpretation programs are often used to provide solutions to complex problems involving multiple sites. Such programs typically operate by selecting peaks from the difference Patterson map and then attempting to find a set of heavy-atom sites that generate vectors that explain the peak set. The program available in Xsight may be understood as applying a reciprocal-space formulation of the difference Patterson search problem (Badger and Athay 1998). This method is somewhat similar to the search algorithm in the Xhercules program within the XtalView package but uses a different functional form for the correlation function and extends the search more directly to multisite cases. The search procedure encoded in this program is:
1. Search the crystal asymmetric unit for potential sites by placing a model atom at each point in the asymmetric unit, expanding this single-site model by the crystal symmetry operations and computing the square of the calculated diffraction |Fh|2 . For each position, compute the classical correlation coefficient C1 of this calculated value with the observed difference Patterson coefficients (|Fph|-|Fp|)2 . Record the positions of all sites that exceed a minimum threshold value in C1.
2. For a single-site heavy atom search, the best solution is the position that gave the maximum value of C1. For a double- or multiple-site search, take the site that gave the maximum value of C1 as site 1 and then try each of the other recorded values as site 2, evaluating the correlation coefficient C2 for each of these two-site trials. In conjunction with this two-site search, store all the potential two-site solutions for which C2 > C1. For a two-site search, the best solution is the pair of atoms that gave the maximum value of C2. For a multi-site search, the two atoms that give the maximum value of C2 are recorded as the first two sites.
3. To search for a third atom, the atoms in the list of potential two- site solutions are each used as putative third sites, and the atom that gives the maximum value of the correlation coefficient C3 is retained as site three. The multi-site heavy atom search is then continued in an analogous fashion for additional atoms until the best value Cn+1 is less than the value of Cn or a maximum of eight sites is reached.
No automated method is available that gives a completely correct solution to the difference Patterson function in all cases, and it is always wise to do a visual check that the predicted vectors from the heavy-atom solution are consistent with peaks in the difference Patterson function. Xsight contains graphical tools for calculating and displaying the interatomic vectors on the difference Patterson function.
Regardless of the method used to obtain the heavy-atom solution, an intrinsic ambiguity remains with respect to the enantiomorph of the heavy-atom constellation. For MIR phasing, the consequence of the wrong enantiomorph is a correct looking electron density map, but with amino acids, helices, and other structural features having the wrong hand. For MAD phasing, one enantiomorph gives the correct map and the other enantiomorph gives a completely uninterpretable map. To change the enantiomorph of the heavy-atom solution it is only necessary to transform the fractional heavy-atom coordinates x to -x, y to -y, and z to -z.
Once the major heavy-atom binding sites for the first heavy-atom derivative have been obtained, it is possible to obtain heavy-atom sites for additional derivatives more directly. To do this, the cross-difference Fourier map is computed. This map is obtained from the Fourier coefficients (|Fph| - |Fp|) exp(i
p), where the derivative structure factor amplitudes |Fph| now refer to the new derivative and the phases
p are the SIR phases computed from the first derivative (the phase computation is discussed below). Although the accuracy of these phases is generally quite low, this map is often sufficiently good to show peaks at the positions of the heavy atoms in the new derivative crystal. As additional sets of derivative data are included in the phase determination, the estimates for the phases become more accurate and the location of sites for each succeeding derivative is easier.
Cross-difference Fourier maps are also useful for re-assessing the current heavy atom solutions to find any minor sites that were missed in the initial search and to eliminate erroneous sites. Graphical tools are available in the Xsight module for peak-picking heavy atom sites from cross-difference Fourier maps and for editing the heavy-atom sites.
When the heavy-atom or anomalous-scatterer sites have been located, it is necessary to improve the positional, occupancy, and thermal parameters to obtain optimal protein phases. For IR phasing, the most widely used schemes employ least-squares or maximum-likelihood optimization of the heavy-atom parameters so as to minimize a weighted residual of the form [(|Fph| - |Fp|) - |Fh|]2 . In practice, some additional parameters, such as the scale factors between |Fph| and |Fp| and certain non-isomorphism parameters (which are additional scale factors that are intended to compensate for the lack of isomorphism between the native and derivative crystals), may also be included in this refinement.
Xsight uses the Xheavy routine from XtalView to carry out heavy-atom refinement for MIR phasing. This routine employs a direct correlation search over the heavy-atom parameters so as to maximize the correlation between (|Fph| - |Fp|) and |Fh|. This method is robust and has a large radius of convergence. In the current version of XtalView, the refined parameters include the heavy-atom positions, their occupancies, and overall non-isomorphism parameters. Optionally, the anisotropic temperature factors for the heavy atoms may also be refined.
For MAD phasing in Xsight, the MADSYS program carries out a least-squares refinement of the anomalous scatterer parameters (scalefactor, positions, occupancy, and temperature factor) against estimates of the structure factor for the anomalous scatterer derived from the MAD data analysis.
Once the best possible scale factors between datasets have been obtained and the heavy atom sites are fully refined, the set of protein phases may be computed.
For phasing by isomorphous replacement, the phase angles from a single derivative are given by:
In this equation the source of the phase ambiguity for SIR phase determination may be identified as the cos-1 term, which has two possible values for a given argument. Most computer programs (including the Xheavy program from XtalView, which is used for IR phase calculations in Xsight) actually calculate a phase probability distribution for each reflection. The phase probability distribution (Hendrickson & Lattman 1970) provides a convenient way for combining phase information from multiple derivatives and, possibly, other sources.
For MAD phasing using the MADSYS program, an analytical solution to the phase problem has been implemented (Hendrickson 1991). For a structure factor measured at a given wavelength
F(+h), the equation:
F(+h) = a(
)|oFa|2 + b(
)|oFt||oFa|cos(o
t - o
a) +
c(
)|oFt||oFa|sin(o
t - o
a)
has been derived. A similar equation with all the + signs replaced by - signs relates the Bijvoet mates of these data to the same set of parameters. The coefficients a(
), b(
), and c(
) are simply related to the normal and anomalous scattering factors of the anomalous scatterers. The unknown quantities are |oFt|, |oFa|, and (o
t - o
a), which represent the normal scattering components of the protein structure factor amplitude, the normal scattering component of the anomalous scatterer structure factor amplitude, and the phase difference between the phase angles associated with these two quantities. Provided that at least three datasets are available, these equations may be solved and the phase angles for the structure obtained. As with the MIR phase determination, phase probability distributions, which take into account uncertainties in the measured and derived values, are also calculated by the program.
Given the MIR and MAD phase probability distributions, it is also possible to calculate a figure-of-merit m, which indicates the reliability of each phase angle. The figure of merit is the expectation value of the cosine of the phase error. Thus a figure of merit of zero indicates no phase information, and a figure of merit of one indicates a perfect phase. The figure of merit is used as a weight in calculations of electron density maps with MIR or MAD phases, since this weighting scheme reduces the contribution of poorly phased reflections to the electron density map.

The Molecule replacement section is extracted from the program notes for the REPLACE software package, authored by Dr. Liang Tong.
Given two homologous molecules A and B, the rotation ([
]) that brings them into overlap also brings the interatomic vectors of each molecule into overlap. Crystallographically, the set of interatomic vectors is represented by the Patterson function. Hence, an ordinary rotation function (Rossmann and Blow 1962) can be defined as the overlap of one Patterson function and the rotated version of another Patterson function:
] [
] [
]
B = [C]
A + d
[
] is the deorthogonalization matrix, which converts angstrom coordinates relative to a Cartesian system to fractional coordinates relative to the crystal unit cell. [
], the inverse of [
], is the orthogonalization matrix. The integration volume (
) is usually a sphere centered at the Patterson origin. The radius R of this sphere can be chosen to exclude most of the vectors between atoms in different molecules in the crystal, which are generally longer than the inter-atomic vectors within one molecule.
The rotation [
] is usually represented by a set of polar (
,
,
) or Eulerian (
1,
2,
3) angles.For polar angles,
and
define the orientation of a rotation axis relative to a Cartesian coordinate system, and
defines the angle of rotation around this axis. For Eulerian angles, the three angles define a sequential set of rotations around the x, y, z axes of a Cartesian coordinate system. For both polar and Eulerian angles, the actual meaning of each angle depends on the convention that is used.
The function RF should reach its maximum when the rotation [
] corresponds to the rotation that brings the two molecules to the same orientation. In practice, a grid search is carried out covering the entire unique region of the rotational space and Eq. 3 is evaluated at each grid point. The grid point that gives the highest value for Eq. 3 may be the correct rotation. The rotation function values at other grid points that are evaluated in such a search give an estimate of the background noise in the calculation.
Rotation functions can be classified by the way Eq. 3 is evaluated (for example, the slow and the fast rotation functions) or by whether the two molecules reside in the same crystal (self and cross-rotation functions).
Eq. 3 can actually be evaluated in many ways. If both Patterson functions are expressed as their individual Fourier transforms:
it can be shown (Rossmann and Blow 1962) that:
is the G function, which represents the Fourier transform of a sphere of radius R, and H is the length of the reciprocal space vector h + p[C]. The function assumes the maximum value of 1 when h + p[C] = 0, and it quickly approaches and then oscillates around zero as h + p[C] deviates from zero.
Eq. 6 is sometimes referred to as the "slow" rotation function, as its evaluation is generally time-consuming. Several techniques have been developed to speed up the calculations of the slow rotation function. The rotation function value is proportional to the intensity of the reflections (Eq. 6). We can thus expect that the rotation function values are dominated by strong reflections. Therefore, only the strong reflections of crystal B (p) are used in the calculation. These reflections, also known as the large terms (Tollin and Rossmann 1966), are selected based on the criterion:
Ip
C X <Ip>
where C is the cutoff value and <Ip> is generally calculated in shells of equal reciprocal volume. A cutoff value of 1.5 usually selects about 20% of the reflections as large terms.
Given the large terms of crystal B, the summation over the reflections of crystal A (h) can be limited to those that make h + p[C] small, since otherwise the value of the G function is negligible. For each large term (p), only reflections in crystal A that are within an interpolation box centered on the reciprocal lattice point closest to -p[C] are used in the summation. The size of this interpolation box is generally 3 (reciprocal lattice points) X 3 X 3.
Calculation of the G function value based on Eq. 8 at each rotation search grid point is tedious. Tables of the G function values can be set up beforehand, either in reference to the value of H or in reference to the h, k, l index in the interpolation box, to speed up the calculation.
The values of the rotation function vary relatively slowly as a function of the rotation angle. If a reasonably fine search grid is used, it is unlikely for the neighbors of a grid point to have high rotation function values if the grid point itself has a low rotation function value. Therefore, evaluation of rotation functions using Eq. 6 can be accelerated by ignoring such grid points in the calculations (Tong 1993).
Even with the improvements described, calculation of the slow rotation function is rather time-consuming. The fast rotation function (Crowther 1972) uses a different approach to evaluate Eq. 3. The Patterson function is expanded in terms of the spherical harmonics:
Slmn (r,
,
) = jl (klnr) Yml (
,
)
where jl is the normalized spherical Bessel function of order l, and kln is chosen such that jl(klnR) = 0. Ylm is the normalized spherical harmonics function. The expansion coefficient almn can be calculated from:
where Tlmn is the Fourier transform of Slmn.
With such an expansion for both Patterson functions, and utilizing the special properties of the spherical harmonics, Eq. 3 can be simplified to:
where the rotation is represented by a set of Eulerian angles (
1,
2,
3), and the function fmm´ is dependent only on
2. Eq. 11 is a 2D Fourier transform over
1 and
3. Therefore, for any given
2, the equation can be evaluated by the fast Fourier transform (FFT) technique, thus greatly accelerating the rotation function calculation.
The calculation of the expansion coefficients almn, however, still requires some amount of computing time. The large term approach as developed for the slow rotation function can also be employed here. Test calculations have shown that a much smaller cutoff (C) value should be used (roughly 0.5) to obtain better rotation function results.
When two molecules A and B exist in the same crystal, a self-rotation function can be used to determine the orientation of the local symmetry axis relating the two molecules. Polar angles are usually preferred for representing the rotation [
] in ordinary self-rotation functions. For example, if the molecules are related by a two-fold axis,
can be fixed at 180° and the rotation search can be limited to the
and
angles. The search results can then be presented as a stereographic projection.
If the two molecules are related by an improper rotation (i.e., an angle of rotation, not a multiple of 360), a full 3D rotation search may be necessary. It is usually more advantageous to represent the rotation [
] as a set of Eulerian angles in such a case. The unique region of rotational space for different space group symmetries is given by Rao et al. (1980). Once the rotation [
] is known, a set of polar angles can be extracted from it.
The ordinary rotation function always has a maximum value at the origin (
= 0 for polar angles,
2 = 0,
1 +
3 = 0 for Eulerian angles) corresponding to the overlap of the Patterson function onto itself. This value can be set to be a specific constant (for example, 999) and a scale factor then applied to all the rotation function values to make the maximum 999. This brings the self-rotation function to an "absolute" scale. The rotations corresponding to the crystallographic symmetry elements should all produce rotation function values of 999.
If two molecules are in two separate crystals, the ordinary cross-rotation function can be used to determine the relative orientation of the two molecules. The two crystals may have been prepared experimentally, or an artificial crystal can be prepared by placing a model, usually the structure of a homologous protein, into an arbitrary unit cell. In the latter case, the dimensions of the unit cell are usually chosen large enough so that the intermolecular vectors are longer than the intramolecular vectors to avoid their interference in the rotation function calculations. Cell dimensions that are twice the dimensions of the model should generally suffice.
Eulerian angles are generally preferred for the cross-rotation function. The unique region of rotational space for different space-group symmetries of the A and B crystals is given by Rao et al (1980).
Sometimes it is known that a direction in crystal A is aligned with a direction in crystal B (for example, the alignment of the local two-fold axis of a dimer in two different crystal forms). In such a case, a pre-rotation can be used first to align the two directions. Subsequent rotation searches are then carried out in polar angles, varying only the angle
.
Many macromolecular crystals contain assemblies of the macromolecule obeying a simple (local) point-group symmetry. The ordinary self-rotation function can be used to determine the orientation of each local symmetry axis individually. However, if the symmetry of the point group is known, orientations of all the symmetry elements of the point group can be determined at the same time, using the so-called locked self-rotation function (Tong and Rossmann 1990). It assumes a point-group symmetry for the macromolecular assembly and maintains that symmetry throughout its searches (i.e., the local symmetry is locked to the one specified at the outset).
A standard orientation is defined for the local symmetry point group. For example, a 222 point-group symmetry can be defined with its three two-fold axes along the x, y, and z axes of a Cartesian coordinate system. If [In](n = 1,...,N) represents the set of local symmetry rotation matrices in the standard orientation, the rotation matrices after a rotation [E] has been applied is given by:
n] = [E] [In] [E]-1
An ordinary self-rotation function value Rn can be calculated corresponding to each of the rotation matrices. The locked rotation function value RL is defined as the average of the individual values:
Locked self-rotation function calculations are generally carried out in terms of Eulerian angles. However, if a local symmetry axis is known to be aligned with a direction in the crystal, polar angles can be used.
The locked rotation function simplifies the self-rotation searches for large macromolecular assemblies (for example, viruses). Moreover, it also leads to a reduction by a factor of about
in the background noise level of the rotation function as compared to the ordinary self-rotation function.
The slow rotation function is employed in the locked self-rotation function, since it requires Eq. 3 to be evaluated at specific sets of angles.
The locked self-rotation function can also be used when the macromolecular assembly does not obey a simple point group. In such cases, however, it is usually difficult to know the relative arrangements of the individual local symmetry elements beforehand.
The presence of local symmetry can also be utilized in the cross-rotation searches if only the monomer is used as the model. For every rotation [E] that brings the search model into the same orientation as that of one of the molecules in the assembly, there should be a set of other rotations that brings the search model into overlap with other molecules in the assembly. Assuming that [In](n = 1, ... ,N) are the local symmetry rotation matrices in crystal B, which can be determined from either ordinary or locked self-rotation function calculations, the set of rotations is given by:
] [In] [E] [
]
Therefore, a locked cross-rotation function can be defined as the average of the individual ordinary cross-rotation function values evaluated with the rotation matrices [Cn].
Rotation function calculations usually proceed in two steps.
In the first step, the unique region of the rotational space is searched to find the sets of angles that maximize the rotation function. Self-rotation function calculations are usually limited to sections of constant
; the slow rotation function can be used with grid intervals along
and
of 2° to 3°. For ordinary cross-rotation function calculations, the fast rotation function should be used in this initial search. If the results from the fast rotation function are not clear, try using a different radius and see if that makes a difference. (This is due to a peculiarity in the calculation of the almn coefficients in the fast rotation function.) The slow rotation function can also be used but it takes more CPU time. Since the slow rotation function uses a completely different approach, it should certainly be tried if results from the fast rotation function are unsatisfactory. A relatively coarse search grid (generally 3° or 4° intervals) is needed to reduce the total number of grid points in the search.
Once an estimate for the set of angles is obtained, the rotation search should be repeated using finer grid intervals (0.5°-1°). The slow rotation function is used. The search should be limited to a small region (3°-5°) around the values for the rotation angles as determined by the initial search. This gives optimized values for the rotation angles.
If the local symmetry is two-fold, the translation element along this axis d// can be determined (Rossmann et al. 1964):
This calculation does not need the phase information of the reflections.
Once phase information is available, the location of the local symmetry axis in the unit cell can be determined based on the maximal electron-density overlap of the subunits related by the local symmetry (Blow et al. 1964; Tong 1993).
If sA and sB are the centers of the two molecules:
The electron density overlap of the two molecules:
where the integration volume is a sphere of radius R centered at sA, can be written as:
Given the rotational relationship between molecules A and B and an estimate for the center of molecule A, the center of molecule B can be determined by a Fourier transform (Eq. 17). The two molecules can reside in the same crystal or in two different crystals.
Once initial estimates for [C], sA, sB are available, Eq. 17 can be used to optimize these parameters. A fine search can be carried out around the current values for [C] and sA (or sB, only one needs to be used in the search) to obtain values that maximize the correlation.
If molecules A and B reside in the same crystal, the position of this local symmetry axis can be determined. Assuming s lies on this axis:
where d// is the translation element along this local symmetry axis (see Eq. 15). Eq. 17 then becomes:
Therefore, the center of the local symmetry axis can be determined by a Fourier transform. This function producez a long, sausage-shaped density corresponding to the position of the local symmetry axis. To obtain more accurate parameters for the position of the local symmetry element, use Eq. 17. After maximization of the overlap, the rotational parameters of the local symmetry axis is given by [C]. The position of the local symmetry axis is given by (sA+sB )/2, and d// is given by the projection of sB-sA along the direction of the local symmetry axis.
Protein samples often crystallize into two (or more) different space groups. The orientational relationship between a molecule in the first crystal form and one in the second can be determined by a cross-rotation function between the two crystal forms. Thus, we have the following relationship:
C] [RBC] [
B] yB
where the two crystal forms are B and C.
Now a cross-rotation function can be calculated between a search model and crystal forms B and C simultaneously. The two individual rotation functions are:
B] [RAB] [
A] yA
C] [RAC] [
A] yA
Noting the relationship between the B and the C crystals, we should have:
C] [RBC] [
B] yB
C] [RBC] [
B] [
B] [RAB] [
A] yA
C] [RBC] [RAB] [
A] yA
The cross-locked rotation function (or triple rotation function) can be defined as the product of the rotation function values corresponding to matrix [RAB] between the A and B crystals and matrix [RBC][RAB] between the A and C crystals.
The above derivation can easily be generalized to cases where there is more than one rotational relationship between crystals B and C and where there are more than two crystal forms (e.g., a quadruple rotation function).
This function is defined as the product of the electron density for two copies of the same molecule. It depends on both the rotational and the translational relationships between the two copies (and in this sense the function is more than just a rotation function).
The relationship between the two copies of the molecule can be written as:
If SA and SB are the centers of the molecule in crystals A and B, respectively, then:
The correlation between the two molecule copies should be a function of [C], SA, and SB:
Given the orientation of a symmetry axis and the translational elements along the axis, the location of the symmetry axis in the unit cell can be determined by correlating the electron density of the two molecules that are related by the symmetry axis.
The local symmetry axis can be written as:
where d|| is the translation element along the direction of the symmetry axis, and d
is the translation element in the plane perpendicular to the symmetry axis. If s is a point lying on the symmetry axis, it follows that:
The translation function is defined as a function of s:
Given the orientation of a symmetry axis and an input position belonging to one of the molecules related by the symmetry axis, other positions related to the input position by the local symmetry axis can be calculated.
The local symmetry axis can be written as:
If s is the input position, a translation function can be defined in terms of its symmetry-related position, s
´= [C]s + d:
Translation searches are used to determine the location of molecules (with known orientations) in a crystal unit cell. The search is normally formulated as a minimization or maximization of certain indicators by varying the positional parameters of the molecules in the unit cell. Most indicators involve a comparison between the observed and calculated structure factor amplitudes. For example, the difference between the observed and calculated structure factor amplitudes (or the R factor) can be minimized. Alternatively, the correlation between the observed and calculated structure factor amplitudes can be maximized.
The crystallographic R factor is often used as an indicator in translation searches. It is a measure of the percentage difference between the observed and calculated structure factor amplitudes:
A similar R factor can be defined based on the square of the structure factor amplitudes, for example, and R factor based on intensity:
In the equations above, kF and kI are scale factors, usually calculated in shells of equal reciprocal volume, that bring the observed and the calculated structure factors to the same level:
The R factor values are very sensitive to the position and orientation of the molecules in the unit cell. The correlation coefficients between the observed and calculated structure factors are somewhat less sensitive. Hence they may allow for larger errors in the reflection data as well as in the orientational parameters of the molecules. As for the R factors, the correlation coefficients can be defined based on the amplitude or the intensity of the reflections:
In the equations, N is the number of reflections that are used in the calculation. Unlike the R factors, the correlation coefficients do not depend on the scale factors between the observed and the calculated structure factors.
Note on the region of the unit cell that should be covered for a translation search with the above indicators
Because the molecule to be searched has a determined orientation, it can only reside in one of the asymmetric units in the unit cell. Lacking knowledge as to which asymmetric unit the molecule is occupying, the entire unit cell would need to be searched. However, most space groups have alternative origins, which means that the position of a molecule in the unit cell can only be determined to within certain sets of translations. For example, in space group P212121,, there are eight alternative origins, at (0,0,0), (1/2, 0, 0), (0, 1/2, 0), (0, 0, 1/2), (1/2, 1/2, 0), (1/2, 0, 1/2), (0, 1/2, 1/2), and (1/2, 1/2, 1/2). This implies that the region that should be searched to locate a molecule need only be 1/8 the volume of the unit cell. Once the first molecule has been located, the origin is determined as well. The search for subsequent molecules needs to cover the entire unit cell.
To calculate R factors and correlation coefficients for a translation search, structure factors need to be calculated for molecules at different positions in the unit cell. The orientations of the molecules are determined beforehand with rotation searches (e.g., with the program GLRF). Therefore, only the positional parameters of the molecules need to be varied.
The structure factor equation can be written as a double summation--first over the atoms in the asymmetric unit of the unit cell, and then over all the asymmetric units:
where j goes over all the atoms in the asymmetric unit and n goes over all the asymmetric units. The nth crystallographic symmetry operator is given by:
where Tn is the rotational component and dn the translational component of the symmetry operator.
For simplicity, consider only one molecule in the asymmetric unit. In the translation search, the molecule is placed at different positions in the unit cell:
where x0 is a translation vector that relates the molecule located at x´j with that located at xj. Substituting x´j in Eq. 42 for xj in Eq. 39:
where fh,n is the structure factor calculated based on the nth symmetry-related molecule only:
Therefore, the summation over the atoms in the structure factor calculation, a rather time-consuming process, needs to be performed only once, for the molecule at a reference position. Subsequent structure factor calculation after translation of the molecule from the reference position no longer depends on the number of atoms present in the unit cell (Eq. 42). The reference position is usually chosen with the center of the molecule at (0,0,0). Then the translation vector that is determined from the translation searches defines the center of the molecule in the unit cell.
Eq. 42 can be generalized to allow for the presence of other molecules that are to remain stationary during the translation search:
where Ah is the contribution from the stationary molecules. This formulation is useful if there are more than one molecule in the asymmetric unit. The position of one of the molecules can be determined first and the model is then input as a stationary molecule for the search of the position of the next molecule.
Besides the R factor and the correlation coefficient, another commonly used translation-search indicator is the correlation between the observed and the calculated Patterson maps. The correlation is defined as the product of the observed and the calculated Patterson functions integrated over the unit cell,

e-2
i(h+p)udu
is 1 when h+p = 0 and is 0 otherwise.
The calculated structure factor is a function of the translation vector x0 (Eq. 44). Combining Eq. 44 and Eq. 45:
The first two terms in Eq. 46 contribute a constant to the correlation and hence are omitted from further calculation. It can be seen from Eq. 46 that reflections (h, k, l) and (-h, -k, -l) contribute equally to the correlation. The translation function is therefore a Fourier transform, with hTn and h(Tn - Tm) as indices, and coefficients defined by Eq. 47.
Sometimes it may be desirable to use a subset of the symmetry-related molecules in the translation function calculation. For example, in space group P41212, molecules at (x, y, z) and (-x, -y, 1/2 + z) can be used in the calculation. The translation function Fourier transform will have "reflections" with indices (2h, 2k, 0), hence only the z = 0 section needs to be calculated. More importantly, the resolution of the translation function transform in this case is twice that of the reflection data used in the structure factor calculation.
The intramolecular vectors can be removed from the observed Patterson map by subtracting the structure factors calculated from individual symmetry-related molecules (Eq. 43) from the observed structure factors:
where k is a weighting factor on the intramolecular vectors. It has been suggested that only the phases, rather than both the phases and the amplitudes, of the calculated structure factors be used in the translation function calculation (Eq. 47).
If an atomic model needs to be placed in an electron density map that has been obtained through other methods (for example, the MIR method), the electron density correlation translation function can be used. It is defined as the product of the observed and calculated electron density functions integrated over the unit cell. The derivation of the function is similar to that of the Patterson correlation translation function:
Therefore the electron density correlation translation function is also a Fourier transform.
The following relationships among Fh and fh,1 are used by the program:

Density modification techniques are used in macromolecular crystallography to improve an initial set of phase angles (usually obtained from multiple isomorphous replacement experiments) in order to calculate more interpretable maps for structure determination.
The usual density modification scheme is to:
1. Modify the initial map by imposing some physical restraint on the (real space) density values.
2. Perform a Fourier transform on the modified map to calculate structure factors and new phase estimates.
3. Calculate a new map by combining the observed structure factor amplitudes with the new phases.
In practical applications, weighting schemes are often used to weight the structure factors according to the expected phase accuracy, and the modified phases obtained are combined statistically with the initial phase set.
Density modification methods are usually applied in a series of iterative cycles. The procedure usually converges in a few cycles to a map that is reasonably consistent with both the experimentally measured structure factor data and the real-space restraint. This convergence is, however, not necessarily a sign that the resulting phase angles have been improved (Fenderson et al. 1990).
The density restraints most often used in macromolecular crystallography are:
Although the nature of the Fourier transform makes it fairly obvious that modifying a density point in real space will change all the structure factor amplitudes and phases in reciprocal space, the mechanism of density modification schemes may be understood in a more concrete fashion by using a simple example.
Suppose the density modification scheme is a solvent flattening operation that leaves density in the protein volume unchanged and flattens all features in the solvent volume to zero density. This operation can be described by multiplying the density map by a mask containing values of 0 at pixels in the solvent volume and values of 1 at pixels in the protein volume. Mathematically, this operation may be written:
'x =
xmx
where mx represents the mask function that operates on the electron densities and the prime superscript is used to denote a modified density.
In reciprocal space, the multiplicative real-space operation corresponds to a convolution of phased structure factors with the Fourier transform of the mask function:
where Mh-k is the Fourier transform of the mask function and the Fk are phased structure factors at reciprocal lattice points k. The convolution operation creates a new structure factor vector F'h by operating on a group of structure factors. The phase angle part of this vector corresponds to the phase that would be obtained after taking the Fourier transform of a solvent-flattened map.
For solvent flattening, the mask function mx is the molecule envelope, which has a large extent in real space (the diameter of a small protein is ~ 20Å). Therefore, the function Mh is narrow in reciprocal space, with significant magnitude across only a few reciprocal lattice points. For this reason, solvent flattening provides strong phase relationships only between reflections that are close to each other in reciprocal space. This explains why density modification methods work best for problems involving complete data and randomly distributed phase errors. Missing structure factor data leaves holes at some values of Fk and consequently causes errors in the calculation of F'h. Furthermore, if all the structure factors within a region of reciprocal space are poorly phased, the convolutional relation does not adequately couple these data to the more distant reflections that are accurately phased.
The primary component of a solvent flattening algorithm is the mask that is used to define which map grid points belong to the protein and which map grid points belong to the solvent. The aim of the method is to remove density fluctuations from the solvent region of the crystal, since these density features are much more likely to be due to phase errors than to any genuine non-uniformity in the solvent density.
Automatic mask generation algorithms have been developed for defining which pixels belong to the protein and solvent volumes in the crystal (Wang 1985; Reynolds et al. 1985). These methods are based on evaluating the statistical properties of the density (that is, the average density value or the density variance) in small regions around each grid point.
The algorithm accessible within the Xsight/Modify_Density pulldown uses the fact that grid points contained within a volume where the average density variation is high are likely to be protein. In this algorithm a grid of local density variances is calculated and then smoothed to create contiguous regions with high and low density variance. The expected solvent fraction in the crystal can be calculated using a relation between the protein molecular weight and the crystal cell dimensions (Matthews 1968).
where Vm is the crystal volume in cubic angstroms per unit protein molecular weight. Typical values of V(solv) for protein crystals are 0.34-0.65. From the expected solvent fraction a threshold is set in the smoothed variance map to divide the volume into the appropriate solvent and protein volumes. Grid points in these two volumes are flagged in binary fashion in the protein mask. Since automatic algorithms for defining the protein/solvent envelope tend to leave some defects in the envelope (typically small "solvent" holes within a volume considered protein or small volumes of "protein" in the centers of solvent channels) manual editing tools have been provided within the Xsight/Modify_Density pulldown for removing these errors.
The success of a solvent flattening run depends on the accuracy of the protein/solvent envelope and the size of the solvent volume. The phasing power is expected to be proportional to the fraction of the crystal filled by solvent.
A type of density modification that has become popular in recent years is histogram matching. In this method, electron density values in the protein region are rescaled so that a histogram of their values is matched to an ideal histogram of the protein density obtained from a known structure at the same resolution (Zhang and Main 1990). For all globular proteins, histograms of the density distribution are expected to be very similar. Experimentation with the method indicates that this type of density modification may give a comparable level of phase improvement to solvent flattening (Zhang 1993).
The histogram matching algorithm in Xsight uses a set of standard protein histograms computed from the sulfoxide dismutase structure. A set of histograms for the solvent region are also available. The effect of the solvent histograms is to suppress density features in the solvent area in a less drastic way than with conventional solvent flattening.
Averaging electron densities over protein sub-units related by non-crystallographic symmetry is an extremely powerful method for phase refinement and extension in cases with many sub-units. An analysis of the problem (Arnold and Rossmann 1986) suggests that the phasing power varies as the square root of the number of protein sub-units.
Theoretical and algorithmic considerations for the use of non-crystallographic symmetry averaging have been established by Bricogne (1974, 1976). On modern computers the implementation is simplified since there is usually sufficient memory to store the whole electron density map in the computer's virtual memory.
Important concerns for the application of these methods are the accuracy with which non-crystallographic symmetry matrices are known and the accuracy of the interpolation operations used to get densities for averaging and crystal cell reconstruction. Most non-crystallographic symmetry averaging software (including the routines available in Xsight) allow the user to carry out a local refinement of the non-crystallographic symmetry relations before running cycles of density averaging. In the routines within Xsight, the symmetry operations are refined by carrying out a local density correlation search. A grid-spacing that is 4.5 times the high resolution limit of the structure factor data is used for the interpolations that are used in the density averaging process.

In macromolecular structure refinement one adjusts the atomic model parameters (atomic positions, mobility factors, and occupancies) to improve the agreement of the set of calculated structure factor amplitudes, Fcalc, with the set of observed structure factor amplitudes, Fobs. The function that has usually been minimized is the residual:
where Wh represents a weighting function and kF is a scale factor relating observed to calculated structure factor amplitudes.
A more sophisticated statistical treatment of the crystallographic refinement problem based on the concepts of maximum likelihood has recently led to the development of new target functions which are significantly more effective for the refinement of partial models and the refinement of models that contain significant errors (Pannu and Read, 1996). The amplitude based maximum likelihood target is similar to Eq. 55 but with kFcalc replaced by the expectation value of the observed structure factor, <Fobs>, and a maximum likelihood weighting function, Wh, that employs a set of cross-validated test data. The intensity based maximum likelihood target is similar but with intensities replacing the structure factor amplitudes. These maximum likelihood targets, and a maximum likelihood target which employs phase probabilities, were implemented within X-PLOR 98.0 and subsequent releases of the CNX and X-PLOR programs. Since the intensity based maximum likelihood target appears the most powerful and is calculated with computational cost little different from the least-squares residual target, the intensity based target is now the CNX program default. Furthermore, refinements using the maximum likelihood methodology in the CNX program will normally give more converged final models than refinements with ProLSQ95 since ProLSQ95 uses a least-squares residual target.
Because the resolution of the diffraction data is usually too low to accurately determine individual atomic positions, this function must be combined with a geometric restraint term or energy function to ensure a reasonable stereochemistry in the refined model. Thus, macromolecular crystallographers almost invariably optimize the structural parameters against a combined energy function that contains the structure factor "energy" term, E(X-ray), and a stereochemical energy term, E(chem).
The scale factor, K, that relates the crystallographic energy to the chemical energy term is a key parameter in structure refinement. Obviously one can produce a structure with good agreement with the X-ray data but poor stereochemistry or a structure with idealized stereochemistry but poor agreement to the data. Ideally, we wish to obtain a structure that is both stereochemically reasonable and that agrees well with the diffraction data.
In the ProLSQ95 refinement program the various energy terms are associated with expected standard deviations from ideality. The value used for the standard deviation of the X-ray energy term is often set to approximately
In the CNX program the default weighting of X-ray and geometric energies is such that the gradients of the X-ray energy term is one-third of the chemical energy term obtained from an unrestrained molecular dynamics simulation. These two methods are essentially rules-of-thumb which have been found to give reasonable results over a wide range of problems. A more objective method (albeit at the cost of some additional computation) is to use cross-validation methods to find the optimal weight for scaling the X-ray and chemical energy contributions together. In the crystallographic context, the "free R value" (Brünger 1992) has become widely used as a cross-validation method for optimizing the parameters used in crystallographic refinements. In the free R value test a subset of reflection data (typically 5-10%) are deliberately left out of the refinement. Since these data were not used in the optimization, a comparison of calculated structure factors with these data is an unbiased test of the fitting process. In particular, the cross-validation method guards against overfitting the diffraction data. The CNX and ProLSQ95 programs both contain the facility for carrying out free R value tests.
Simulated annealing is a method used in non-linear optimization problems that is particularly attractive for systems with many parameters and an energy surface containing multiple minima (Kirkpatrick et al. 1983). The basic methodology is to heat the system that is being studied so that it may overcome barriers in the energy landscape. If the system is slowly cooled, it is possible to settle into a lower energy state than could be reached by conventional minimization, which tends to become trapped in local energy minima.
Implementations of simulated annealing in the context of X-ray crystallographic refinement make use of molecular dynamics algorithms for carrying out the conformational search (Brünger et al. 1987). The most widely used computer program for X-ray refinement is the CNX program.
The simulated annealing search technique requires careful scheduling of heating/cooling cycles and phases of conventional minimizations to achieve good results (Brünger et al. 1990). The simulated annealing interface for the CNX program in Xsight has been designed to give the user considerable flexibility in scheduling the refinement run.
All the refinement programs contained within Xsight include the capability of applying the Babinet bulk solvent scattering correction to the structure factors calculated from the protein model. This scattering correction contains a maximum of two free parameters, is computationally efficient, and does not usually need to be changed as the refinement progresses (Badger 1997). By including a bulk solvent scattering correction in the refinement calculations it is possible to include all of the very low resolution data without distorting the scale factor between observed and calculated structure factors at high resolution. Refinement calculations that include the very low-resolution data with the bulk solvent scattering correction are usually more convergent than refinements from which these data are omitted.
This bulk solvent scattering correction is based on the idea that the solvent distribution in the crystal is the complement of the protein distribution. In this case, we can invoke the Babinet principle and model the solvent scattering as a correction to the calculated structure factor such that:
where the scale factor Ksol takes a theoretical value equal to the average solvent density divided by the average protein density (typically a value in the range 0.78-1.0) and the smoothing factor Bsol (typically a value in the range 100-400 Å) is applied to remove the protein's structural features from the solvent scattering. For refinements using the RotLSQ and ProLSQ95 programs, the values for Ksol and Bsol should be explicitly entered through the Xsight interface. (Setting Ksol to zero is equivalent to running the refinement without bulk solvent scattering correction.) The scripts written by the Xsight interface to the CNX program include a macro that automatically determines an optimal value of Ksol by carrying out an R-factor search over a physically reasonable range of values.
The version of the least-squares minimization program ProLSQ95 that is implemented in Xsight contains the facility for calculating structure factors and derivatives by direct summation methods or by using FFT methods. The FFT methods are normally an order of magnitude faster at the cost of an almost negligible loss in accuracy. Crystallographic computations using the FFT method have been described by Ten Eyck (1977). The algorithms described by Agarwal (1978) are used to calculate most of the quantities needed for the least-squares refinement. The version of ProLSQ95 with Xsight contains a space-group-specific FFT routine for the very common space group P 21 21 21; calculations for other space groups are carried out in P1 with the appropriate symmetry expansion.
Traditionally, the ProLSQ95 program requires you to restart the program for every cycle of refinement. You also have to determine the damping factor to be applied to the shifts for each cycle. We have modified the submission procedure such that it is now possible to specify the total number of steps of refinement desired. Furthermore, the damping factor will be automatically calculated by the program.
ProLSQ95 uses the Marquardt-Levenberg method to arrive at the shift vector from the gradient of the objective function. The shift vector, however, sometimes points to an even higher objective functional value. To remedy that, the traditional approach in ProLSQ95 is to specify a damping factor to be applied to the shift vector to get a reduced vector. If this fails to reduce the functional value, an even smaller damping factor is applied, until the functional value does decrease at the position of the reduced shift vector. This trial-and-error process of determining the damping factor has been automated in the present algorithm. The first task of the algorithm is to determine the search direction--the direction along which the damping factor should be applied and along which the system should move.
The first test is to determine the angle between the shift vector and the negative of gradient vector. If the angle is larger than 90°, it means that the shift vector is actually pointing uphill of the objective function. In this case, the search direction is then the negative gradient. If the angle is less than ninety degrees, the search vector will be the shift vector.
The second task is determining the damping factor. This is done through polynomial interpolation. The value (P1) at the full search vector is calculated. If it is lower than the starting value (P0), then the position at the full search vector becomes the new starting position for another cycle of ProLSQ95. If, however, P1 is greater than P0, then a binomial is constructed from P1, P0 and the search vector, and a damping factor, d1 is determined from the minimum of the binomial. Next, we evaluate the objective function at the position given by d1* search vector, P2. If P2 is smaller than P0, then this position is adopted, otherwise, we use P2, P1, P0, and the search vector for a cubic fit whose minimum would be the next trial position. If the functional value is smaller than P0, this position is adopted, if it is still higher, this search direction is abandoned in favor of a reduced negative gradient. The whole polynomial fitting procedure is then repeated for this new direction. The use of the negative gradient guarantees, in most practical cases, that the new position will have a lower functional value than the starting one.

Loops present unique challenges for model building as they, unlike regular secondary structures, have no well defined set of backbone torsion angles. In addition, they are more likely to be conformationally disordered. One commonly used technique is to employ the C
positions to screen a structure database for optimal geometry matching. This approach, while quite successful, does not address the possibility of new loop conformations. Also, in certain implementations, the ends of the loops found do not necessarily join smoothly with the rest of the molecule.
We present here an algorithm which is designed to generate loops de novo with ends that match well with the rest of the molecule, while providing optimal fit to any combinations (not necessarily complete) of C
and/or C
positions.
The algorithm comprises two parts. The first part is concerned with the generation of random loop conformations for which we use the random tweak method. The second part is the Monte Carlo driver which performs a simulated annealing procedure using a Metropolis scheme to locate optimal loops compatible with the input C
and C
positions. These two parts are addressed below.
The loop generation method is one due to Shenkin et al. (1987). The loop is defined to be any conformation which satisfies predefined target positions of C
, C, and N of the beginning and ending residues. The precise closure condition of the loop is defined as a set of four distances between the two atoms from the first residue and two atoms from the end residue. The loop is considered closed if the four distances of the conformation match those of the input distances.The first step of a loop generation is to generate a backbone conformation with random
's and
's', starting from the first residue of the loop. Of course, in general, the loop will not meet the closure condition.To close the loop, the algorithm then attempts to vary the intervening
's and
's to minimize the difference between the closure distances of the current conformation and those of the target (see Shenkin et al. for more details).This algorithm displays very desirable sampling properties and imposes minimal CPU demands, and is thus ideally suited for implementation into a Monte Carlo algorithm where a vast number of conformations need to be sampled efficiently.
The purpose of a Monte Carlo algorithm is to locate the optimal loop conformation consistent with the input C
and C
positions. The first step is creating an objective function which the algorithm will minimize. The objective function in this application captures the discrepancy between the current and target CA and CB positions, and the steric violations. Thus the optimal fit is achieved for the conformation that minimizes the objective function. More precisely, the objective function P, is defined as
where Pvdw is the nonbonded steric violation contribution while Ppositions measures the fit to the target C
and C
positions.
= 0 otherwise.
The sum is over a pair of atoms in the loops that are separated by more than two bonds. rdwi and rdwj are the vdw radii of the i-th atom and the j-th atom, whose distance is dist(i,j). and
1 (1.0 - exp(-a*dista(i,j))) + wgtb*
2 (1.0 - exp(-
a*distb(p,q)))
where
1 is over the index i, which runs over the number of C
inputs, while the index j runs over the C
positions of the current loop. a is an adjustable parameter. dista(i,j) is the squared distance between the i-th input C
atom position and the j-th C
atom in the loop. Similarly,
2 runs over the indices pointing to the input C
positions and the current C
positions in the loop respectively. distb(i,j) is the squared distance between the p-th input C
atom position and the q-th C
atom in the loop. wgtb is a parameter which forms a relative weighting between the C
and C
term.The main feature of the function Ppositions is that it does not require knowledge of the sequence order regarding the C
and C
positions, nor does it require the set of input C
and/or C
to be complete. It simply captures the agreement between the current C
and C
positions and a set (non-ordered) of input C
and C
positions. This feature gives the present algorithm the flexibility to handle diverse situations involving gaps and uncertainties in alignment.
The objective function discussed above forms the basis for deciding whether a loop conformation is to be accepted or rejected. In the Metropolis scheme, the Boltzmann probability
is compared with a random number between 0 and 1. P is the value of the objective function of the current conformation, Pold is the value corresponding to the last accepted conformation and T is the "temperature" of the annealing. If Prob is larger than the random number, then the current conformation is accepted, and new conformations will be generated starting from the
and
of this accepted conformation. The temperature typically starts out at a reasonable high value that allows most (>80%) of the conformations to be accepted, and then gradually cools to 0, in which case, only conformations that reduce the objective function would be accepted. To facilitate convergence, as T cools, one also decreases the number of
's and
's to be varied as well as the ranges of the variation. In addition, the variable a in Ppositions also scales up as the temperature cools. This also aids in overcoming local barriers.
It is worth pointing out that the Monte Carlo procedure usually gives many distinct acceptable conformations. This is to be expected, especially in cases where there are gaps in the input positions and thus ambiguities can result. What one hopes to gain from the Monte Carlo procedure is not so much the final unique answer, but a range of likely candidates which can then form the basis of further refinement.
Due to dynamic and static disorder, loop regions are usually relatively poorly defined in X-ray structures. The electron densities in these regions tend to be weak or nonexistent. Even in cases where the density is above noise, it is still quite cumbersome to model loops as they do not easily fall into a pattern as does the regular secondary structure. Therefore, it is desirable to have a method that can generate possible loop conformations in an automated fashion, such that in case of significant disorder or heterogeneity, one can still gain some insight into the possible conformations of the loops concerned, and otherwise help to eliminate the tedium of model building a loop. Monte Carlo (MC) techniques are ideally suited to this kind of task. Thus while sampling vast numbers of possibilities, MC will also bias systematically towards ultimately finding the best solution. We have developed a MC procedure that will sample loop conformations and optimize them against structure factors.
The algorithm has two parts. The first part is the generation of loop conformations. The second part is the Monte Carlo driver which uses the Metropolis scheme for performing simulated annealing.
After the loop has been generated, the criterion for rejecting this loop or not is based on the discrepancy of the calculated and the observed structure factors. The objective function which captures this discrepancy is
where Pvdw is the nonbonded steric violation while Pstruct is the penalty for the difference between observed and calculated structure factors.
(dist(i,j) - (rvdwi + rvdwj)) = 4 for dist(i,j) <rvdwi+rvdwj =
0 otherwise.
The sum is over any pair of atoms (at least one of which is from the loop) that are separated by more than 2 bonds. rvdwi and rvdwj are the vdw radii of the i-th and j-th atoms respectively, whose distance is dist(i,j).
1/sigma (|Fobs| - |Fcalc|)2
The sum is over all the input observed reflections Fobs. Sigma is the standard deviation as input into ProLSQ95. Fcalc is the structure factor calculated for the current molecular conformation.
After the objective function has been constructed, we can then use the Metropolis scheme to drive the simulated annealing.The Boltzmann probability
is compared with a random number between 0 and 1. P is the value of the objective function of the current conformation, Pold is the value corresponding to the last accepted conformation and T is the "temperature" of the annealing. If Pold is larger than the random number, then the current conformation is accepted, and new conformations will be generated starting from the
and
of this accepted conformation. The temperature typically starts out at a reasonable high value that allows most (>80%) of the conformations to be accepted, and then gradually cools to 0, in which case, only conformations that reduce the objective function would be accepted. To facilitate convergence, as T cools, one also decreases the number of
's and
's to be varied as well as the ranges of the variation.
It is important to point out that the Monte Carlo procedure usually gives many distinct acceptable conformations. This is to be expected, since the potential surface given by P should have many low lying minima. What one hopes to gain from the Monte Carlo procedure is not so much the final unique answer but more a range of likely candidates which can then form the basis of further refinement.