| Combi-Chem |

Definition of metric spaces
When working with classical descriptors, three options may be used to calculate the distance between molecules in property space.
Definition of metric spaces
[ (Xik - Xjk)2 ]
[ |Xik - Xjk| ]
[ |Xik - Xjk| / ( |Xik| + |Xjk| ) ]
Example:
Na:
1 0 1 1 0 0 1 0 1 1 0 = 6Nb:
1 1 0 1 1 1 0 1 0 1 0 = 7Nab:
1 0 0 1 0 0 0 0 0 1 0 = 3Tab =
3/10Hab =
7Dab =
6/13Cab =
3/sqrt(42)

Molecular diversity descriptors
Introduction
The appropriate model descriptors to apply to a diversity problem are determined by several factors. For a large combinatorial library, practical considerations may dictate choosing descriptors that are rapid to calculate (for example, most 2D properties). This most likely occurs in the design of libraries for initial HTS rounds. It may also be appropriate to calculate the properties of the library R-group substituents independently (these groups typically consist of up to a few thousand entries) rather than calculating the properties of the complete library members after generation (potentially millions of entries). When the number of models is reduced to a practical number (typically up to a few thousands or tens of thousands), 3D properties (which are more computationally demanding) can then be calculated and included in the analysis.
Information-content indices are based on application of information-theory equations to classes defined by graph theory concepts (Bonchev 1983, References). They differentiate the models according to their degree of complexity relative to atomic composition, topology, and bonding patterns. Electronic descriptors include total charge, dipole moment, and the sum of the atomic polarizabilities. Descriptors that describe the hydrophobic nature of the molecules include an estimation of the water/octanol partition coefficient (AlogP) and molar refractivity (Ghose & Crippen 1985; Viswanadhan et al. 1989, References). A set of spatial (3D) descriptors (which depend on the conformations of the model) includes volume, surface area, moment of inertia, radius of gyration, shadow indices (Rohrbaugh & Jurs 1987, References) and descriptors based on mapping atomic partial charges on the solvent-accessible surface area of the model (Stanton & Jurs 1990; Stanton et al. 1991, References).
The optimum set of model descriptors to be used in the selection of the best fragments or side chains for a combinatorial library most likely depends on the specific ligand-receptor system that is the target of the library. Descriptors that are relevant in some systems and help differentiate active from inactive compounds may be irrelevant in other systems.
Graph-theoretic descriptors
The calculation of all of the graph-theoretic descriptors included here is ultimately based on representations of model structures as graphs, where atoms are represented by vertices and covalent chemical bonds by edges.
Please refer to the References list for explanations of graph-theoretic terms and symbols used in the descriptor definitions below.
|
Multiple bonds, if any, are treated as single edges in all descriptor definitions unless specifically mentioned otherwise.
|
Topological descriptors that do not distinguish heteroatoms
The Wiener index (W)
The Wiener index is the sum of the chemical bonds existing between all pairs of heavy atoms in the model. In graph-theoretical terms, this is the sum of lengths of minimal paths between all pairs of vertices representing heavy atoms. This is easily seen to be equal to half the sum of all D-matrix entries (Wiener 1947, Müller et al. 1987, References):
Eq. 1
The Zagreb index (Zagreb)
The Zagreb index is defined as the sum of the squares of vertex valencies (Bonchev 1983, References):
Let M be the number of edges in the graph. For any integer k, define p(k) to be the number of ways of choosing k nonadjacent edges from the graph. Note that p(k) is zero for k > [M/2] since there is no set of k nonadjacent edges in a graph of M edges if k > [M/2].
Eq. 3
with the convention that p(0) = 1.
Eq. 4 Z(G) = Z(H) + Z(K)
The recursion simplifies the graph until one or both of H and K are empty graphs, for which the index is defined as:
Eq. 5 Z(empty graph) = 1
A handy shortcut for graphs consists of disjoint subgraphs (see the example calculation of Z for benzene). If G consists of disjoint subgraphs H and K, then:
Eq. 6 Z(G) = Z(H) Z(K)
Example
Calculate the Hosoya index of benzene. The hydrogen-suppressed graph representing benzene is a hexagon:
Kier and Hall molecular connectivity index ( )
The Kier and Hall molecular connectivity index, originally defined by Randic (1975, References) and subsequently refined by Kier and Hall (1976, References), is a series of numbers designated by order and subgraph type.
The four subgraph types are: path, cluster, path/cluster, and chain. These types emphasize different aspects of atom connectivity within a model--the amount of branching, ring structures present, and flexibility. Here we refer to these subgraph types as P, C, PC, and CH, respectively. They are defined as follows:
Definition:
Given a connected subgraph G:1. If G contains a cycle it is of type CH (chain).
2. If all vertex valencies of G (valencies with respect to G, not the entire graph) are either >2 or =1, G is of type C (cluster).
3. If all vertex valencies (as in 2) are either =2 or =1, G is of type P (path).
4. G is of type PC (path/cluster), which means that valencies >2, =2, and =1 are all present.
a. Subgraphs of order 0 are assigned the class P (path).
|
The molecular connectivity index of order n corresponding to subgraph type s is denoted n
s.
Eq. 7
Where n = number of subgraph vertices.
Eq. 8
Kier & Hall subgraph count
index (SC)
The Kier and Hall subgraph count index is the number of subgraphs of a given type and order (Kier & Hall 1976, References). (See Kier and Hall molecular connectivity index (
) for definitions.)
Kier's shape indices ( n (n = 1, 2, 3))
The Kier's shape indices compare the model graph with "minimal" and "maximal" graphs, where the meaning of "minimal" and "maximal" depends on the order n. This is intended to capture different aspects of model shape.
1 encodes the count of atoms and the presence of cycles
Eq. 9
where P is the number of edges in the graph (edges are paths of length 1, hence the subscript on the 1), Pmax is the number of edges in the maximal graph--namely N(N - 1)/2. Pmin is the number of edges in the minimal graph, namely N - 1.
1 = 2PminPmax/P2
Eq. 10
Order 2:
1=N(N-1)2/P2
2 encodes the branching. P, Pmin, and Pmax now denote the number of paths of length 2 in the corresponding graphs. The maximal graph is taken to be the star graph in which all atoms are adjacent to a common atom. Thus Pmax = (N - 1) (N - 2)/2. The linear graph is again taken as the minimal graph, so Pmin = N - 2.
Eq. 11 k2 = (N - 1) (N - 2)2 / P2
Order 3:
Eq. 9 is adjusted by another factor of 2 -- in the words of the index designer -- "to bring the values into rough equivalence with the other kappa values" (Kier 1990, Hall & Kier 1991, References):
Eq. 12
3 = (N - 1) (N - 3)2 / P2 for N odd
3 = (N - 3) (N - 2)2 / P2 for N even
Topological descriptors that distinguish heteroatoms
Kier & Hall valence- modified
connectivity index
(
v)
The Kier and Hall valence-modified connectivity index is a refinement of the molecular connectivity index (see Kier and Hall molecular connectivity index ( ) for definitions) where a vertex subgraph valence is enhanced to v to take into account electron configuration of the atom represented by the vertex:
Eq. 13
where Zv is the number of valence electrons in the atom, Z is its atomic number, and h is the number of hydrogens bound to it. The formula on the right side is designed to reproduce the unmodified molecular connectivity index for saturated hydrocarbons for which
v = (Zv - h) / (Z - Zv - 1)
v =
. However,
v distinguishes multiple from single bonds. The denominator introduces further distinction between element rows due to the presence of the atomic number Z (Kier & Hall 1976, 1985, References).
Kier's alpha-modified
shape indices ( 
n (n = 1,
2, 3))
The Kier's alpha-modified shape indices are refinements of the shape index (see Kier's shape indices (
n (n = 1, 2, 3))) that take into consideration the contribution that covalent radii and hybridization states make to the shape of the molecule. The indices
an are defined by Eq. 10 through Eq. 12, with the atom count N replaced by the modified atom count N +
, and with the modifier defined as:
Eq. 14
where the summation is over all heavy atoms in the model. Here, ri is the radius of the ith heavy atom and rCsp3 is the radius of the sp3 carbon (taken to be 0.77 Å in this implementation). In this calculation these atoms are considered heavy: C, N, O, F, P, S, Cl, Br, and I (Kier 1990, Hall & Kier 1991, References).
Molecular flexibility index (
)
Eq. 15
where N = number of vertices (Hall & Kier 1991, References).
=
1
2
/N
The Balaban indices (JX and JY)
At this stage the contributions based on heteroatom electronegativities and heteroatom covalent radii are included by modifying the si values above. The modifiers are two-parameter approximations of electronegativities and covalent radii relative to those of carbon. Exact formulas used in the index calculations are:
For relative electronegativity:
Given the values of X and/or Y for each vertex, the numbers si are adjusted as follows:
|
The denominator M - N + 2 is really (number of cycles plus 1) (by the Euler formula) and serves as a normalization against the number of rings present in the molecule.
|
Information-theoretic descriptors
In this approach molecules are viewed as structures that can be partitioned into subsets of elements that are in some sense equivalent. The notion of equivalence depends on the particular descriptor. Consider a partition of a set of N elements into k subsets each consisting of Nk elements:
This degree of uncertainty can be also expressed by the entropy:
The partition P, the probabilities pi, and the mean quantity of information H form the pattern of calculation for all the information-theoretic descriptors.
Information of atomic composition index (IAC-mean, IAC-total)
The atoms in the molecule are partitioned into equivalence classes corresponding to their atomic numbers. The partition then yields the descriptor IAC-mean as the mean quantity of information H as defined above.
Information indices based on the A-matrix
The two information indices in this category are:
Example
Vertex adjacency/equality
The A-matrix consists of zeros and ones, so the partitioning consists of two classes:
Therefore:
k = the valence of the vertex (there is one ordered pair (mj, nj) for each neighboring vertex vj).
And for every j = 1, ... , k:
The index corresponding directly to this partition is the index IC (information content).
The following indices are normalizations of IC:
Path
An alternating sequence of vertices v and edges e beginning and ending with vertices in which each edge is adjacent to the two vertices immediately preceding and following it.
Minimal path
A path of minimal length among all paths connecting a given vertex pair.
i.
|
A total of 10 descriptors are calculated in this set:
1. Area of the molecular shadow in the xy plane (Sxy).
2. Area of the molecular shadow in the yz plane (Syz).
3. Area of the molecular shadow in the xz plane (Sxz).
4. Fraction of area of molecular shadow in the xy plane over area
of enclosing rectangle (Sxy,f).
5. Fraction of area of molecular shadow in the yz plane over area
of enclosing rectangle (Syz,f).
6. Fraction of area of molecular shadow in the xz plane over area
of enclosing rectangle (Sxz,f).
7. Length of molecule in the x dimension (Lx).
8. Length of molecule in the y dimension (Ly).
9. Length of molecule in the z dimension (Lz).
10. Ratio of largest to smallest dimension (
).
------------------------------------------------------------------------
Descriptors based on partial charges mapped on surface area
This set of descriptors (Stanton & Jurs 1990, References) combines shape and electronic information to characterize the molecules. The descriptors are calculated by mapping atomic partial charges on solvent-accessible surface areas of individual atoms. A total of 30 different descriptors is included in the set:
30. Total molecular solvent-accessible surface area (SASA).
2D and 3D fingerprints metrics
Cosine-coefficient diversity
The cosine-coefficient diversity metric is based on Turner, D. B.; Tyrrel, S. M.; Willett, P. J. Chem. Inf. Comput. Sci. 37, 18-22 (1997).
Eq. 23
Simi,j is the cosine-coefficient similarity between molecules i and j, the sum is over all pairs of molecules, and N is the number of molecules.

Data-reduction and -visualization techniques
Principal component analysis
Principal component analysis (PCA) is a data-reduction technique commonly used to relieve redundancies in the description of molecules. The transformation makes linear combinations of the original variables that are orthogonal to each other.
For more information on principal component analysis please consult the relevant references (Dillon & Goldstein 1984; Everitt & Dunn 1992, References).
Descriptor and component scaling
Descriptor scaling
Descriptors should be scaled appropriately so that they are given the same chance to contribute in the statistical analysis. In PCA, for example, in the absence of any scaling, descriptors that have large absolute values with large variance would overwhelm the analysis. By the same token, descriptors with small absolute values (and therefore small absolute variance) would make almost no contribution to the overall variance. Such conditions would bias the analysis towards specific descriptors and produce a similarly biassed set of principal components.
Component scaling
After principal components have been derived, they can also be scaled in several ways (none, mean centering, unit variance, mean centering and unit variance, and range) for the visualization and diversity analysis procedures. One could argue that the absolute amount of variance explained by each component is representative of the dataset and should not be further scaled for visualization or diversity analysis. This would be true if the descriptors in the set used in the analysis had little correlation with each other. Since this seldom occurs, we suggest that proper scaling be applied to the principal components themselves.
Factor analysis
In factor analysis (FA), the original variables are described as linear combinations of presumed common factors (CFs). Since the number of common factors is presumed to be smaller than the number of original variables, the linear problem is thus overdetermined. To make the problem solvable, an error term is allocated.
For more information on factor analysis, please consult the relevant references (Dillon & Goldstein 1984; Everitt & Dunn 1992; Harman 1976, References).
Multi-dimensional scaling
To visualize the diversity of the set of models being analyzed, we represent them in 3D using classical multidimensional scaling (MDS) techniques (Everitt & Dunn 1992, References). The intermolecular distance between each pair of models in N-dimensional property space is given by:
Eq. 24
where Dij2 is the distance between models i and j, and the sum is over all the model descriptors k = 1, P.
In most of the data sets we have analyzed, this 3D representation of the models is quite good, since the three largest eigenvalues typically cover 80% or more of the variance. In practice, if all the intermolecular distances in the data set are Euclidean, then MDS is equivalent to principal component analysis (PCA) of the information matrix X*X (Everitt & Dunn 1992, References).
Fast MDS
A faster method of performing MDS with selected similarity coefficients is also available. This method is based on the observation that in certain cases (described below) the metric matrix can be factored into a product XX* of a matrix X and its transpose, thus the eigenvalues and the eigenvectors MDS evaluates may be calculated from the matrix X*X, a potentially smaller matrix.
| Dataset | Number of Rows | Number of Properties | Standard MDS | Fast MDS |
|---|---|---|---|---|
| dipep_400_fp | 400 | 166 | 10 sec | 2 sec |
| benzo_5760_fp | 5760 | 166 | --- | 18 sec |
| benzo_5760_fp | 5760 | 960 | --- | 6 min 58 sec |

Diversity metric functions
Distance-based functions
Having calculated the various descriptors and, optionally, their principal components, the next step is to select diverse subsets, that is, subsets of fragments or models that efficiently span the property space of the system. The main approaches that have been taken to date for selecting diverse subsets are D-optimal design (Federov 1972; Martin et al. 1995, References) and various clustering methods (Willet 1995, References). D-optimal design can be computationally intensive for datasets composed of large numbers of models and descriptors and suffers from the weakness that it tends to select outliers while leaving regions near the center or mean values of the data set unsampled, particularly where the number of models far exceeds the dimensionality of the property space. Some clustering methods can be applied to larger datasets, but they often suffer from sensitivity to the presence or absence of neighbors to a given point and from the limitation of having to select one model from each cluster.
Consequently, a set of diversity methods has been specifically designed for the C2·Diversity module, which select diverse subsets of models and which do not suffer from these problems. They include functions of the intermolecular distances in N-dimensional property space (or, alternatively, in the space represented by the principal components) that, when optimized, produce subsets of high diversity. These diversity functions are called the MaxMin, PowerSum, Product, and Max MinSpanTree functions:
Eq. 25
Eq. 26
Eq. 27
Eq. 28
where a is a parameter (square root of the "Gaussian alpha" parameter), and Di is the length of edge i in the minimum spanning tree. The sum is taken over all the edges in the tree.
The MaxMin function maximizes the minimum squared distance from each point to all other points in the selected subset of models. The PowerSum function maximizes the inverse of the sum of the reciprocals of the squares of all intermolecular distances between selected points. The Product function maximizes the product of the squares of the intermolecular distances. The value of the MaxMin function is not explicitly dependent on the number of models in the subset. The PowerSum and Product functions are normalized by the number of distances in the subset, to allow comparisons of subsets having different numbers of models.
This function has excellent behavior with regard to both selecting models that tend to uniformly sample property space and also properly treating redundant (or nearly redundant) models in the sample set, especially in selecting subsets with combinatorial selection.
Cell-based functions
These metrics are used for R-group subsetting, which provides the optimum choice of R-groups based on the diversity of the products. In this process, the combinatorial constraint restricts the choice of products to those that can be acheived through a strict matrix of reagents. Several metrics are available as a target function to evaluate the coverage of resulting sub-libraries:
A Monte Carlo algorithm is used to optimize the diversity functions. Optimization entails:
Stochastic optimization procedures
Specific examples of the application of the techniques for visualizing and optimizing model diversity that are described in this section can be found in Hassan et al. (1996, References).
The stochastic optimization technique is fast relative to other methods (such as D-optimal design) and can be applied to large datasets. The CPU requirements are proportional to the product of the number of model descriptors used (or the number of principal components) and the number of models to be selected set. The number of steps required for convergence of the stochastic optimization increases with the total number of ways of selecting (i.e., the total number of combinations of) n models out of a set of M models: M!/(n! (M - n)!). The memory requirements are determined mostly by the size of the X matrix, where the number of rows equals the total number of models in the set, and the number of columns equals the number of model descriptors.
Reagent penalty terms are selected using the Reagent Penalties entry in the Restraints menu, available from the LIBRARY ANALYSIS card in the COMBI-CHEM I deck (for more information on reagent penalties, see Specifying reagent penalty restraints).
Functional form of the reagents penalty function
Any combination of five terms can be defined in the Reagents Penalty Function:
1. Minimize number of different reagents.
2. Minimize number of different suppliers.
3. Minimize total monetary cost of the reagents.
4. Minimize sum of supplier penalties.
5. Minimize sum of reagent penalties.
Each of the penalty functions above has an associated user specified penalty. Suppose that of three molecules in the library subset, two use reagent X and the third uses reagent Y. Say the cost of reagent X is 1, and the cost of reagent Y is 2. There are then two types of penalties associated with reagent cost. The first is simply the sum of the costs for each molecule:
Eq. 29
where NPen is the number of penalty terms (from 1 to 5) that are being used in a particular run.
Eq. 30
where Nreagents is the number of different reagents selected in the subset and Nmax is the maximum number of different reagents that can be used, which is equal to the minimum of:
Penalty based on number of different suppliers (P2)
Eq. 31
where Nsuppliers is the number of different suppliers selected in the subset and Nmax is the maximum number of different suppliers that can be used, which is equal to the minimum of:
Penalty based on total monetary cost (P3)
Eq. 32
where the Maximum reagent cost is the cost of the most expensive reagent among the available reagents.
| Reagent | Cost | Reagent Penalty | Supplier Penalty |
|---|---|---|---|
| reagent1 | 1.0 | 0 | 0 |
| reagent2 | 2.0 | 2 | 0 |
| reagent3 | 7.0 | 4 | 0 |
| reagent4 | 10.0 | 8 | 5 |
| reagent5 | 20.0 | 3 | 5 |
...and a given subset has 3 reagents:
Consider only presence/absence of reagent checked:
Consider only presence/absence of reagent unchecked:
P3 has a minimum value of 0 (Average cost of reagents in the subset = 0) and a maximum value of 1 (all reagents used have a cost equal to the Maximum reagent cost).
Penalty based on reagent penalties (P4)
Eq. 33
The Maximum reagent penalty is the maximum possible value for a reagent penalty, which we have set to 10.
Consider only presence/absence of reagent checked:
Consider only presence/absence of reagent unchecked:
P4 has a minimum value of 0 (Average reagent penalty of reagents in the subset is 0) and a maximum value of 1 (all reagents used have a reagent penalty equal to the Maximum reagent penalty).
Penalty based on supplier penalties (P5)
Eq. 34
The Maximum supplier penalty is the maximum possible value for a reagent penalty, which we have set to 10.
Consider only presence/absence of reagent checked:
Consider only presence/absence of reagent unchecked:
P5 has a minimum value of 0 (Average supplier penalty of supplier in the subset is 0) and a maximum value of 1 (all reagents used have a reagent penalty equal to the Maximum supplier penalty).

The 3D visualization of the models in property space allows an assessment of the effectiveness of the descriptors in differentiating the models and provides insight into how different sets of descriptors result in the selection of different subsets of diverse models.