MSI Product Previous Next Contents Index Top
Combi-Chem



5       Theory

This section describes the theoretical basis of strategies implemented in the Cerius2 Combinatorial Chemistry Software Suite for visualizing and optimizing the structural diversity of sets of models. It includes:

Definition of metric spaces

Molecular diversity descriptors
Introduction

Topological descriptors that do not distinguish heteroatoms
Topological descriptors that distinguish heteroatoms
Information-theoretic descriptors
Information of atomic composition index (IAC-mean, IAC-total)
Information indices based on the A-matrix
Information indices based on the D-matrix
Information indices based on the E-matrix and the ED-matrix
Multigraph information content indices (IC, BIC, CIC, SIC)

Descriptors based on projections of the molecular surface (shadow indices)
Descriptors based on partial charges mapped on surface area

Graph-theoretic descriptors
Data-reduction and -visualization techniques

Principal component analysis
Descriptor and component scaling
Factor analysis
Multi-dimensional scaling

Diversity metric functions

Distance-based functions
Cell-based functions

Stochastic optimization procedures
Functional form of the reagents penalty function
Conclusions


Definition of metric spaces

When working with classical descriptors, three options may be used to calculate the distance between molecules in property space.

When fingerprints are used as descriptors in clustering applications, various definitions of similarity and distance can be used:

Where: Nab is the number of bits models a and b have in common, Na and Nb are the numbers of bits set in a and b, respectively, and sqrt designates the square root.

Example:

Na: 1 0 1 1 0 0 1 0 1 1 0 = 6

Nb: 1 1 0 1 1 1 0 1 0 1 0 = 7

Nab: 1 0 0 1 0 0 0 0 0 1 0 = 3

Tab = 3/10

Hab = 7

Dab = 6/13

Cab = 3/sqrt(42)

Note


Molecular diversity descriptors

Introduction

The appropriate model descriptors to apply to a diversity problem are determined by several factors. For a large combinatorial library, practical considerations may dictate choosing descriptors that are rapid to calculate (for example, most 2D properties). This most likely occurs in the design of libraries for initial HTS rounds. It may also be appropriate to calculate the properties of the library R-group substituents independently (these groups typically consist of up to a few thousand entries) rather than calculating the properties of the complete library members after generation (potentially millions of entries). When the number of models is reduced to a practical number (typically up to a few thousands or tens of thousands), 3D properties (which are more computationally demanding) can then be calculated and included in the analysis.

The Cerius2 Combinatorial Chemistry Software Suite provides methods for library characterization through its ability to calculate a broad set of 2D and 3D properties, including topological and information-content indices and electronic, thermodynamic, and geometric descriptors.

Topological indices are 2D descriptors based on graph theory concepts (Kier & Hall 1976, 1986; Katritzky & Gordeeva 1993, References). These indices are widely used in QSPR and QSAR studies. They help to differentiate the models mostly according to their size, degree of branching, flexibility, and overall shape.

Information-content indices are based on application of information-theory equations to classes defined by graph theory concepts (Bonchev 1983, References). They differentiate the models according to their degree of complexity relative to atomic composition, topology, and bonding patterns. Electronic descriptors include total charge, dipole moment, and the sum of the atomic polarizabilities. Descriptors that describe the hydrophobic nature of the molecules include an estimation of the water/octanol partition coefficient (AlogP) and molar refractivity (Ghose & Crippen 1985; Viswanadhan et al. 1989, References). A set of spatial (3D) descriptors (which depend on the conformations of the model) includes volume, surface area, moment of inertia, radius of gyration, shadow indices (Rohrbaugh & Jurs 1987, References) and descriptors based on mapping atomic partial charges on the solvent-accessible surface area of the model (Stanton & Jurs 1990; Stanton et al. 1991, References).

The optimum set of model descriptors to be used in the selection of the best fragments or side chains for a combinatorial library most likely depends on the specific ligand-receptor system that is the target of the library. Descriptors that are relevant in some systems and help differentiate active from inactive compounds may be irrelevant in other systems.

Graph-theoretic descriptors

The calculation of all of the graph-theoretic descriptors included here is ultimately based on representations of model structures as graphs, where atoms are represented by vertices and covalent chemical bonds by edges.

These descriptors fall into two categories:

All these descriptors are based on hydrogen-suppressed graphs; that is, there are no vertices corresponding to hydrogens and no edges corresponding to bonds connecting hydrogens to other atoms.

Please refer to the References list for explanations of graph-theoretic terms and symbols used in the descriptor definitions below.

Note

Multiple bonds, if any, are treated as single edges in all descriptor definitions unless specifically mentioned otherwise.

Topological descriptors that do not distinguish heteroatoms

The Wiener index (W)

The Wiener index is the sum of the chemical bonds existing between all pairs of heavy atoms in the model. In graph-theoretical terms, this is the sum of lengths of minimal paths between all pairs of vertices representing heavy atoms. This is easily seen to be equal to half the sum of all D-matrix entries (Wiener 1947, Müller et al. 1987, References):

Eq. 1            

The Zagreb index (Zagreb)

The Zagreb index is defined as the sum of the squares of vertex valencies (Bonchev 1983, References):

Eq. 2            

The Hosoya index (Z)

Let M be the number of edges in the graph. For any integer k, define p(k) to be the number of ways of choosing k nonadjacent edges from the graph. Note that p(k) is zero for k > [M/2] since there is no set of k nonadjacent edges in a graph of M edges if k > [M/2].

The Hosoya index (Z) is the sum of all (nonzero) p(k):

Eq. 3            

with the convention that p(0) = 1.

It is a moderately easy exercise in graph theory to prove that Eq. 3 can also be given in terms of the following recursion (implemented in C2·Diversity). Let G be the graph for which the Hosoya index Z(G) is to be calculated. Remove an edge from G and denote the resulting graph as H. Again, remove the same edge from G, this time removing all the edges adjacent to it as well. Denote the resulting graph as K. Then the following is always true:

Eq. 4             Z(G) = Z(H) + Z(K)

The recursion simplifies the graph until one or both of H and K are empty graphs, for which the index is defined as:

Eq. 5             Z(empty graph) = 1

A handy shortcut for graphs consists of disjoint subgraphs (see the example calculation of Z for benzene). If G consists of disjoint subgraphs H and K, then:

Eq. 6             Z(G) = Z(H) Z(K)

Example

Calculate the Hosoya index of benzene. The hydrogen-suppressed graph representing benzene is a hexagon:

Begin the recursion by removing, say, the right vertical edge and the edges connected to it:

Continue with the first term in a similar manner: remove another vertical edge from it:

Both terms on the right side are disjoint graphs, each consisting of two identical subgraphs.

Thus:

The calculation is almost complete. For a graph G consisting of one edge, the corresponding H and K graphs are both empty, therefore Z( - ) = Z(empty graph) + Z(empty graph) = 1 + 1

Also:

The Hosoya index of benzene is thus: Z(benzene) = 3 x 3 + 2 x 2 + 5 = 18

The index displayed in the study table is the natural logarithm of Z, to handle the rapid growth of the index with model size (see Hosoya 1971; Rouvray 1987, References).

Kier and Hall molecular connectivity index ( )

The Kier and Hall molecular connectivity index, originally defined by Randic (1975, References) and subsequently refined by Kier and Hall (1976, References), is a series of numbers designated by order and subgraph type.

The four subgraph types are: path, cluster, path/cluster, and chain. These types emphasize different aspects of atom connectivity within a model--the amount of branching, ring structures present, and flexibility. Here we refer to these subgraph types as P, C, PC, and CH, respectively. They are defined as follows:

Definition:

Given a connected subgraph G:

1.   If G contains a cycle it is of type CH (chain).

Otherwise:

2.   If all vertex valencies of G (valencies with respect to G, not the entire graph) are either >2 or =1, G is of type C (cluster).

Otherwise:

3.   If all vertex valencies (as in 2) are either =2 or =1, G is of type P (path).

Otherwise:

4.   G is of type PC (path/cluster), which means that valencies >2, =2, and =1 are all present.

Order refers to the number of edges in a subgraph. The allowable orders are 0, 1, ... , M (M = the number of edges in the entire graph).

Note

a.   Subgraphs of order 0 are assigned the class P (path).

b.   Subgraphs of order 1 and 2 are necessarily of type P only.

c.   Subgraphs of order 3 can be of type P, C, or CH only.

The molecular connectivity index of order n corresponding to subgraph type s is denoted ns.

Given an order-n and a subgraph of type s, consider all connected subgraphs of type s consisting of n edges. For each vertex vi in a subgraph, its valence ni (with respect to the entire graph) is calculated and the

partial index nP corresponding to the given subgraph is found according to:

Eq. 7            

Where n = number of subgraph vertices.

Finally, the partial indices are summed over all connected subgraphs of the requested type s (Kier & Hall 1976, Kier & Hall 1985, References):

Eq. 8            

Kier & Hall subgraph count index (SC)

The Kier and Hall subgraph count index is the number of subgraphs of a given type and order (Kier & Hall 1976, References). (See Kier and Hall molecular connectivity index () for definitions.)

Kier's shape indices ( n (n = 1, 2, 3))

The Kier's shape indices compare the model graph with "minimal" and "maximal" graphs, where the meaning of "minimal" and "maximal" depends on the order n. This is intended to capture different aspects of model shape.

Order 1:

The descriptor 1 encodes the count of atoms and the presence of cycles

relative to the minimal and maximal graphs. For N vertices, the maximal graph includes edges between all vertex pairs. For the minimal graph, a linear path of N - 1 edges connecting the vertices is taken.

The shape index of order 1 is then defined as:

Eq. 9             1 = 2PminPmax/P2

where P is the number of edges in the graph (edges are paths of length 1, hence the subscript on the 1), Pmax is the number of edges in the maximal graph--namely N(N - 1)/2. Pmin is the number of edges in the minimal graph, namely N - 1.

By inserting the formulas for Pmax and Pmin we obtain the implemented formula:

Eq. 10             1=N(N-1)2/P2

Order 2:

The descriptor 2 encodes the branching. P, Pmin, and Pmax now denote the number of paths of length 2 in the corresponding graphs. The maximal graph is taken to be the star graph in which all atoms are adjacent to a common atom. Thus Pmax = (N - 1) (N - 2)/2. The linear graph is again taken as the minimal graph, so Pmin = N - 2.

Eq. 9 thus yields:

Eq. 11             k2 = (N - 1) (N - 2)2 / P2

Order 3:

For order 3, the counts of paths of length 3 are considered, and the maximal graph chosen is a twin star (Kier 1990, References) with Pmax = (N - 1) (N - 3)/4 for N odd and Pmax = (N - 2)2/4 for N even. The minimal graph is again linear with Pmin = N - 3.

Eq. 9 is adjusted by another factor of 2 -- in the words of the index designer -- "to bring the values into rough equivalence with the other kappa values" (Kier 1990, Hall & Kier 1991, References):

Eq. 12             3 = (N - 1) (N - 3)2 / P2 for N odd
3 = (N - 3) (N - 2)2 / P2 for N even

Topological descriptors that distinguish heteroatoms

Kier & Hall valence- modified connectivity index (v)

The Kier and Hall valence-modified connectivity index is a refinement of the molecular connectivity index (see Kier and Hall molecular connectivity index ( ) for definitions) where a vertex subgraph valence is enhanced to v to take into account electron configuration of the atom represented by the vertex:

Eq. 13             v = (Zv - h) / (Z - Zv - 1)

where Zv is the number of valence electrons in the atom, Z is its atomic number, and h is the number of hydrogens bound to it. The formula on the right side is designed to reproduce the unmodified molecular connectivity index for saturated hydrocarbons for which v = . However, v distinguishes multiple from single bonds. The denominator introduces further distinction between element rows due to the presence of the atomic number Z (Kier & Hall 1976, 1985, References).

Kier's alpha-modified shape indices ( n (n = 1, 2, 3))

The Kier's alpha-modified shape indices are refinements of the shape index (see Kier's shape indices (n (n = 1, 2, 3))) that take into consideration the contribution that covalent radii and hybridization states make to the shape of the molecule. The indices an are defined by Eq. 10 through Eq. 12, with the atom count N replaced by the modified atom count N + , and with the modifier defined as:

Eq. 14            

where the summation is over all heavy atoms in the model. Here, ri is the radius of the ith heavy atom and rCsp3 is the radius of the sp3 carbon (taken to be 0.77 Å in this implementation). In this calculation these atoms are considered heavy: C, N, O, F, P, S, Cl, Br, and I (Kier 1990, Hall & Kier 1991, References).

Molecular flexibility index ()

The molecular flexibility index is a descriptor based on structural properties that restrict a molecule from being "infinitely flexible", the model for which is an endless chain of C(sp3) atoms. The structural features considered to prevent a molecule from attaining infinite flexibility are: (a) fewer atoms, (b) the presence of rings, (c) branching, and (d) the presence of atoms with covalent radii smaller than those of C(sp3). These features are encoded in the index as follows:

Eq. 15             = 1 2/N

where N = number of vertices (Hall & Kier 1991, References).

The Balaban indices (JX and JY)

The Balaban indices constitute a highly discriminating descriptor whose values do not substantially increase with model size and number of rings present. Its evaluation begins with the D-matrix modified as follows:

Having constructed the modified D-matrix the row sums are calculated:

Eq. 16            

where N is the number of vertices and i = 1, ... , N.

At this stage the contributions based on heteroatom electronegativities and heteroatom covalent radii are included by modifying the si values above. The modifiers are two-parameter approximations of electronegativities and covalent radii relative to those of carbon. Exact formulas used in the index calculations are:

For relative electronegativity:

Eq. 17             X = 0.4196 - 0.0078 i + 0.1567 Gi

For relative covalent radius:

Eq. 18             Y = 1.1191 + 0.0160 i - 0.0537 Gi

where i is the atomic number and Gi is the (short) periodic table group number. These modifiers are used only with nonmetals: B, C, N, O, F, Si, P, S, Cl, As, Se, Br, Te, and I. For other heteroatoms the values are set at X = Y = 1.

Given the values of X and/or Y for each vertex, the numbers si are adjusted as follows:

sai = X si (for the index JX).

sai = Y si (for the index JY)

and the result is inserted in the final formula for the index:

Eq. 19            

where J equals either JX or JY (depending on the modifier type used), M is the number of edges, N is the number of vertices, and the sum is over all pairs (i, j) with adjacent vertices vi and vj (Balaban 1982, Balaban & Ivanciuc 1989, References).

Note

The denominator M - N + 2 is really (number of cycles plus 1) (by the Euler formula) and serves as a normalization against the number of rings present in the molecule.

Information-theoretic descriptors

In this approach molecules are viewed as structures that can be partitioned into subsets of elements that are in some sense equivalent. The notion of equivalence depends on the particular descriptor. Consider a partition of a set of N elements into k subsets each consisting of Nk elements:

equivalence class: 1 2 ... k

number of elements in each: N1 N2 ... Nk

N1 + N2 + ... + Nk = N

Given a partition P as above we use the notation:

P = N (N1, N2, ... , Nk)

A probability distribution can be associated with the partition:

pi = Ni/N

which is the probability for a randomly chosen element to belong to class i.

This degree of uncertainty can be also expressed by the entropy:

Hi = - lb pi (lb is the base-2 logarithm)

The mean entropy of such a probability distribution is then:

Eq. 20            

which, according to Shannon's statistical information theory (Bonchev 1983 and references therein, References), can be viewed as a measure of the mean quantity of information contained in each structure element (in bits per element).

The partition P, the probabilities pi, and the mean quantity of information H form the pattern of calculation for all the information-theoretic descriptors.

Information of atomic composition index (IAC-mean, IAC-total)

The atoms in the molecule are partitioned into equivalence classes corresponding to their atomic numbers. The partition then yields the descriptor IAC-mean as the mean quantity of information H as defined above.

The descriptor IAC-total is defined as N x IAC-mean, where N is the number of atoms in the molecule.

Information indices based on the A-matrix

The two information indices in this category are:

These indices (and several others below) are based on partitioning elements of the A-matrix according to two basic modes:

Eq. 21 and Eq. 22 in the following example illustrate these modes:

Example

Vertex adjacency/equality
The A-matrix consists of zeros and ones, so the partitioning consists of two classes:

P = N2 ( 2M, N2 - 2M)

with M equal to the number of edges (thus 2M equals the number of ones in the A-matrix) and N equal to the number of vertices (N2 - 2M is the number of zeros in the A-matrix).

Therefore:

Eq. 21            

Vertex adjacency/magnitude
Each matrix element aij is now treated as an equivalence class of aij elements. Here, each equivalence class consists of one or zero elements, so the partition is (discarding the classes of zero elements):

P = 2M(1, 1, ... , 1) (2M ones)

The index VADJ_mag is thus a rather simple one:

Eq. 22            

Information indices based on the D-matrix

Two types of indices are based on this matrix:

These descriptors are defined in exactly the same manner as the vertex adjacency indices, except that the distance matrix is used instead of the adjacency matrix.

Information indices based on the E-matrix and the ED-matrix

The indices based on these matrices are:

These are the descriptors based on the edge adjacency and the edge distance matrices, in exact analogy with those given under Information indices based on the A-matrix.

Multigraph information content indices (IC, BIC, CIC, SIC)

To each vertex v, an unordered sequence of ordered pairs is assigned: { (m1, n1), (m2, n2), ... , (mk, nk) }, called a coordinate, such that:

k = the valence of the vertex (there is one ordered pair (mj, nj) for each neighboring vertex vj).

And for every j = 1, ... , k:

Having assigned the coordinates to vertices, the partition of vertices is constructed in the usual way, where two vertices are considered equivalent if their coordinates are the same (as unordered k-tuples, i.e., the repetitions of ordered pairs are not ignored as they would be if we treated the k-tuples purely as sets).

The index corresponding directly to this partition is the index IC (information content).

The following indices are normalizations of IC:

The CIC (complementary information content) measures the deviation of IC from its maximum possible value corresponding to the partition into classes containing one element each:

ICmax = -N x (1/N) x lb(1/N) = lb(N)

and thus the CIC index is defined as (Sarkar et al. 1978, Bonchev et al. 1981, Bonchev 1983, Katritzky et al. 1993, References):

Terms

Path
An alternating sequence of vertices v and edges e beginning and ending with vertices in which each edge is adjacent to the two vertices immediately preceding and following it.

If there is a path of the form:

v0, e0, v1, e1, ... , vn - 1, en - 1, vn

then we say the vertices v0 and vn are connected by this path.

Path length
The number of edges in the path.

Minimal path

A path of minimal length among all paths connecting a given vertex pair.

Cycle
A path v0, e0, ... , vn with v0 = vn.

Connected graph
A graph is connected if for any pair of vertices there is a path connecting them.

Subgraph
A subset of the set of vertices and edges of the original graph which is itself a valid graph (namely, with each edge it contains the vertices adjacent to it).

Vertex valence
Number of edges adjacent to a vertex. The ith vertex valence equals the sum of matrix elements in the ith row (or column) of the A-matrix. It is denoted i.

Adjacency matrix (A-matrix)
A symmetric N x N matrix {aij} defined as follows:

N = number of vertices,

aij = 1 if the vertices vi and vj are connected by an edge,

aij = 0 otherwise.

Edge adjacency matrix (E-matrix)
A symmetric M x M matrix {ekl} that is "complementary" to the A-matrix and is defined as follows:

M = number of edges,

ekl = 1 if the edges ek and el share exactly one common vertex,

ekl = 0 otherwise.

Distance matrix (D-matrix)
A symmetric N x N matrix {aDij}, (N = number of vertices) where aDij is the number of edges in a path of minimal length connecting vi to vj.

Edge distance matrix (ED-matrix)
A symmetric M x M matrix {eDij}, (M = number of edges), where eDij is the number of vertices in a path of minimal length connecting ei to ej (not counting the terminal vertices of the path).

Descriptors based on projections of the molecular surface (shadow indices)

The shadow indices set of geometric descriptors helps to characterize the shape of the molecules. The descriptors are calculated by projecting the model surface on three mutually perpendicular planes, xy, yz, and xz (Rohrbaugh & Jurs 1987, References). These descriptors depend not only on conformation but also on the orientation of the model. To calculate them, the models are first rotated to align the principal moments of inertia with the x, y, and z axes.

Figure 2 . Geometric definition used to calculate shadow indices

A total of 10 descriptors are calculated in this set:

1.   Area of the molecular shadow in the xy plane (Sxy).

2.   Area of the molecular shadow in the yz plane (Syz).

3.   Area of the molecular shadow in the xz plane (Sxz).

4.   Fraction of area of molecular shadow in the xy plane over area of enclosing rectangle (Sxy,f).

5.   Fraction of area of molecular shadow in the yz plane over area of enclosing rectangle (Syz,f).

6.   Fraction of area of molecular shadow in the xz plane over area of enclosing rectangle (Sxz,f).

7.   Length of molecule in the x dimension (Lx).

8.   Length of molecule in the y dimension (Ly).

9.   Length of molecule in the z dimension (Lz).

10.   Ratio of largest to smallest dimension ().

------------------------------------------------------------------------

Descriptors based on partial charges mapped on surface area

This set of descriptors (Stanton & Jurs 1990, References) combines shape and electronic information to characterize the molecules. The descriptors are calculated by mapping atomic partial charges on solvent-accessible surface areas of individual atoms. A total of 30 different descriptors is included in the set:

1.   Partial positive surface area: sum of the solvent-accessible surface areas of all positively-charged atoms (PPSA-1)

2.   Partial negative surface area: sum of the solvent-accessible surface areas of all negatively-charged atoms (PNSA-1)

3.   Total charge-weighted positive surface area: partial positive solvent-accessible surface area multiplied by the total positive charge (PPSA-2)

4.   Total charge-weighted negative surface area: partial negative solvent-accessible surface area multiplied by the total negative charge (PNSA-2)

5.   Atomic charge-weighted positive surface area: sum of the product of solvent-accessible surface area times partial charge for all positively charged atoms (PPSA-3)

6.   Atomic charge-weighted negative surface area: sum of the product of solvent-accessible surface area times partial charge for all negatively charged atoms (PNSA-3)

7.   Difference in charged partial surface areas: partial positive solvent-accessible surface area minus partial negative solvent- accessible surface area (DPSA-1)

8.   Difference in total charge-weighted surface areas: total charge- weighted positive solvent-accessible surface area minus total charge-weighted negative solvent-accessible surface area (DPSA-2)

9.   Difference in atomic charge-weighted surface areas: atomic charge-weighted positive solvent-accessible surface area minus atomic charge-weighted negative solvent-accessible surface area (DPSA-3)

10.   Fractional charged partial surface areas: set of six descriptors obtained by dividing descriptors 1 to 6 by the total molecular solvent-accessible surface area (FPSA-1, FPSA-2, FPSA-3, FNSA-1, FNSA-2, FNSA-3)

16.   Surface-weighted charged partial surface areas: set of six descriptors obtained by multiplying descriptors 1 to 6 by the total molecular solvent-accessible surface area and dividing by 1000 (WPSA-1, WPSA-2, WPSA-3, WNSA-1, WNSA-2, WNSA- 3)

22.   Relative positive charge: charge of most positive atom divided by the total positive charge (RPCG)

23.   Relative negative charge: charge of most negative atom divided by the total negative charge (RNCG)

24.   Relative positive charge surface area: solvent-accessible surface area of the most positive atom divided by descriptor 22 (RPCS)

25.   Relative negative charge surface area: solvent-accessible surface area of most negative atom divided by descriptor 23 (RNCS)

26.   Total hydrophobic surface area: sum of solvent-accessible surface areas of atoms with absolute value of partial charges less than 0.2 (TASA)

27.   Total polar surface area: sum of solvent-accessible surface areas of atoms with absolute value of partial charges greater than or equal to 0.2 (TPSA)

28.   Relative hydrophobic surface area: total hydrophobic surface area divided by the total molecular solvent-accessible surface area (RASA)

29.   Relative polar surface area: total polar surface area divided by the total molecular solvent-accessible surface area (RPSA)

30.   Total molecular solvent-accessible surface area (SASA).

2D and 3D fingerprints metrics

Cosine-coefficient diversity

The cosine-coefficient diversity metric is based on Turner, D. B.; Tyrrel, S. M.; Willett, P. J. Chem. Inf. Comput. Sci. 37, 18-22 (1997).

The cosine-coefficient diversity metric works on 2D fingerprints and numeric descriptors and uses cosine coefficient similarity to compute the mean pairwise intermolecular dissimilarity of a set of N molecules in order(N) time, where N is the number of molecules in the subset:

Eq. 23            

Simi,j is the cosine-coefficient similarity between molecules i and j, the sum is over all pairs of molecules, and N is the number of molecules.


Data-reduction and -visualization techniques

Principal component analysis

Principal component analysis (PCA) is a data-reduction technique commonly used to relieve redundancies in the description of molecules. The transformation makes linear combinations of the original variables that are orthogonal to each other.

Given the various sets of model descriptors, some properties are often highly correlated with one another, and others are poor differentiators. Correlated properties add dimensionalities to the problem but really no new information. Therefore, it is often convenient to reduce the dimensionality of the full set of descriptors and uncover the true dimensionality of the problem with PCA. Principal components then contain most of the information that explains differences across a set of molecules and are used in diversity analysis.

Mathematical description

If we have a set of molecules characterized by p descriptors: X1, X2, ...Xp in a study table, then the principal components (PCs) are given as:

PC1 = W11 x X1 + .... Wp1 x Xp ...

PCm = W1m x X1 + ... Wpm x Xp

Where the loadings {Wi,j} are chosen so as to maximize the variance of the PCs. This is easily expressed in a matrix notation:

PC = X x W

Note

Different normalization schemes are provided to assign similar weights to all the model descriptors, including mean centering, unit-variance scaling, mean centering and unit-variance scaling, and range scaling. Enough principal components should be calculated to cover about 90% of the variance of the information matrix X*X, where X represents the matrix of descriptor values (columns) for all models (rows) and X* is the transpose of X. The number of principal components required is typically 5-10.

For more information on principal component analysis please consult the relevant references (Dillon & Goldstein 1984; Everitt & Dunn 1992, References).

Descriptor and component scaling

Descriptor scaling

Descriptors should be scaled appropriately so that they are given the same chance to contribute in the statistical analysis. In PCA, for example, in the absence of any scaling, descriptors that have large absolute values with large variance would overwhelm the analysis. By the same token, descriptors with small absolute values (and therefore small absolute variance) would make almost no contribution to the overall variance. Such conditions would bias the analysis towards specific descriptors and produce a similarly biassed set of principal components.

The default setting for PCA in Cerius2 (removal of column mean and scaling to unit variance) sets analysis conditions so that descriptors are given equal chances to contribute in the statistical analysis. Usually, descriptors scaled in this way provide the best analysis conditions.

Note

Component scaling

After principal components have been derived, they can also be scaled in several ways (none, mean centering, unit variance, mean centering and unit variance, and range) for the visualization and diversity analysis procedures. One could argue that the absolute amount of variance explained by each component is representative of the dataset and should not be further scaled for visualization or diversity analysis. This would be true if the descriptors in the set used in the analysis had little correlation with each other. Since this seldom occurs, we suggest that proper scaling be applied to the principal components themselves.

For the default set of combinatorial chemistry descriptors, several descriptors represent molecular size in various forms (MW, Vm, MolRef, Area, etc.). Since a specific molecular feature (size) is represented by different descriptors, this feature would therefore be emphasized as many times in the PCA. Therefore, principal component #1 would relate to molecular size and be emphasized (large amount of explained variance) in relation to the number of descriptors that describe this feature. Similarity and diversity selections would then be biassed towards molecular size only, with little contribution from other molecular characteristics.

For this reason, we suggest that principal components be scaled by mean and variance in similarity and diversity applications. Likewise, graphical representations should use the same scaling procedures so as to properly reflect the results of the computations.

Factor analysis

In factor analysis (FA), the original variables are described as linear combinations of presumed common factors (CFs). Since the number of common factors is presumed to be smaller than the number of original variables, the linear problem is thus overdetermined. To make the problem solvable, an error term is allocated.

If we have a set of molecules characterized by p descriptors: X1, X2, ... Xp in a study table, then the descriptors are given as:

X1 = V11 x CF1 + .... Vm1 x CFm + E1 ...

Xp = V1p x CF1 + ... Vmp x CFm + Ep

This is easily expressed in a matrix notation:

X = CF x V + E

The assumption in FA is that we can estimate reasonably well the amount of error or uniqueness (E) present in the original data. Two standard procedures have been commonly used to estimate uniqueness:

After the uniqueness has been estimated, the remaining variance (communality) should be expressed by the common factors. In comparison with PCA, FA makes a distinction between communality and uniqueness, and PCA handles the original variance as a whole (including the "error").

For more information on factor analysis, please consult the relevant references (Dillon & Goldstein 1984; Everitt & Dunn 1992; Harman 1976, References).

Multi-dimensional scaling

To visualize the diversity of the set of models being analyzed, we represent them in 3D using classical multidimensional scaling (MDS) techniques (Everitt & Dunn 1992, References). The intermolecular distance between each pair of models in N-dimensional property space is given by:

Eq. 24            

where Dij2 is the distance between models i and j, and the sum is over all the model descriptors k = 1, P.

We determine the set of model coordinates that satisfy the distances, similar to the embedding step of distance geometry techniques, in which atomic coordinates are obtained from interatomic distances.

Similar concepts have been applied for visualizing conformers in 3D space, where the distance between pairs of conformers corresponds to the root-mean-square (rms) deviation between superimposed conformers (Levitt 1983, Hempel et al. 1995, References). The 3D model coordinates are given by the three largest eigenvectors of the corresponding metric matrix, scaled by the square root of the corresponding eigenvalue. In this plot, the distance between any pair of models approximates the distance in the full N-dimensional space. How good the approximation is depends on how much of the variance of the metric matrix is covered by the three largest eigenvalues.

In most of the data sets we have analyzed, this 3D representation of the models is quite good, since the three largest eigenvalues typically cover 80% or more of the variance. In practice, if all the intermolecular distances in the data set are Euclidean, then MDS is equivalent to principal component analysis (PCA) of the information matrix X*X (Everitt & Dunn 1992, References).

Fast MDS
A faster method of performing MDS with selected similarity coefficients is also available. This method is based on the observation that in certain cases (described below) the metric matrix can be factored into a product XX* of a matrix X and its transpose, thus the eigenvalues and the eigenvectors MDS evaluates may be calculated from the matrix X*X, a potentially smaller matrix.

The following table compares performance of the standard and the fast MDS method using the files dipep_400_fp.dat from the directory Cerius2-Resources/COMBICHEM/demos (using the ISIS public keys as the independent variable column) and a file containing ISIS public and private keys for a benzodiazepine library with 5760 analogs, for which the standard MDS method could not be run. All calculations were done on an SGI R10K machine.

Table 4. comparison of the performance of the standard and the fast MDS method

Dataset Number of Rows Number of Properties Standard MDS Fast MDS
dipep_400_fp   400   166   10 sec   2 sec  
benzo_5760_fp   5760   166   ---   18 sec  
benzo_5760_fp   5760   960   ---   6 min 58 sec  


Diversity metric functions

Distance-based functions

Having calculated the various descriptors and, optionally, their principal components, the next step is to select diverse subsets, that is, subsets of fragments or models that efficiently span the property space of the system. The main approaches that have been taken to date for selecting diverse subsets are D-optimal design (Federov 1972; Martin et al. 1995, References) and various clustering methods (Willet 1995, References). D-optimal design can be computationally intensive for datasets composed of large numbers of models and descriptors and suffers from the weakness that it tends to select outliers while leaving regions near the center or mean values of the data set unsampled, particularly where the number of models far exceeds the dimensionality of the property space. Some clustering methods can be applied to larger datasets, but they often suffer from sensitivity to the presence or absence of neighbors to a given point and from the limitation of having to select one model from each cluster.

Consequently, a set of diversity methods has been specifically designed for the C2·Diversity module, which select diverse subsets of models and which do not suffer from these problems. They include functions of the intermolecular distances in N-dimensional property space (or, alternatively, in the space represented by the principal components) that, when optimized, produce subsets of high diversity. These diversity functions are called the MaxMin, PowerSum, Product, and Max MinSpanTree functions:

Eq. 25            

Eq. 26            

Eq. 27            

where Nd is the number of intermolecular distances [that is, 0.5 M (M - 1), where M is the number of selected models]; and npower is the exponent of the PowerSum function (npower = -1 corresponds to the reciprocal of the sum of reciprocals and is the default value used in C2·Diversity).

Eq. 28            

where a is a parameter (square root of the "Gaussian alpha" parameter), and Di is the length of edge i in the minimum spanning tree. The sum is taken over all the edges in the tree.

Overview of diversity functions

The MaxMin function maximizes the minimum squared distance from each point to all other points in the selected subset of models. The PowerSum function maximizes the inverse of the sum of the reciprocals of the squares of all intermolecular distances between selected points. The Product function maximizes the product of the squares of the intermolecular distances. The value of the MaxMin function is not explicitly dependent on the number of models in the subset. The PowerSum and Product functions are normalized by the number of distances in the subset, to allow comparisons of subsets having different numbers of models.

The Max MinSpanTree function is based on calculating the minimum spanning tree [see Algorithms, R. Sedgewick, Addison-Wesley, 1983 and Computer Algorithms, Introduction to Design and Analysis, S. Baase, Addison-Wesley, 1988, References] for each subset of selected points in the stochastic optimization process, which is then used to compute an error function based on the length of each edge in the tree.

This function has excellent behavior with regard to both selecting models that tend to uniformly sample property space and also properly treating redundant (or nearly redundant) models in the sample set, especially in selecting subsets with combinatorial selection.

The four functions show some differences with respect to the subsets of models they select, but they all share the characteristic that higher values of the function correspond to subsets of models that differ more in the values of the model descriptors, that is, more diverse subsets.

Cell-based functions

These metrics are used for R-group subsetting, which provides the optimum choice of R-groups based on the diversity of the products. In this process, the combinatorial constraint restricts the choice of products to those that can be acheived through a strict matrix of reagents. Several metrics are available as a target function to evaluate the coverage of resulting sub-libraries:

Table 5. Cell-based functions

metric definition
Cell-based fraction:   Optimize fraction of occupied cells
Fraction = Cells occupied by subset / Number of occupied cells  
Cell-based Chi2:   Optimize Chi2 distribution
Chi2 = Sum (Ni - Nave)2  
Cell-based entropy:   Optimize entropy function
Entropy = -Sum (Ni x Log(Ni) )  
Cell-based density:   Optimize density function
Density = -Sum (Ni/Nave x Log(Ni/Nave)  
Where: Ni = Number of compounds in cell i
N
ave = Average number of compounds per cell
Sum = Sum over cell occupied by subset  


Stochastic optimization procedures

A Monte Carlo algorithm is used to optimize the diversity functions. Optimization entails:

Taken together, the calculation of model descriptors and the optimization of diversity functions provide a way to select diverse subsets of models and to quantitate the diversity of each subset. The diverse subsets resulting from the application of this procedure depend strongly, of course, on the descriptors used to characterize the models. Different sets of descriptors usually result in the selection of different subsets of diverse models.

Specific examples of the application of the techniques for visualizing and optimizing model diversity that are described in this section can be found in Hassan et al. (1996, References).

The stochastic optimization technique is fast relative to other methods (such as D-optimal design) and can be applied to large datasets. The CPU requirements are proportional to the product of the number of model descriptors used (or the number of principal components) and the number of models to be selected set. The number of steps required for convergence of the stochastic optimization increases with the total number of ways of selecting (i.e., the total number of combinations of) n models out of a set of M models: M!/(n! (M - n)!). The memory requirements are determined mostly by the size of the X matrix, where the number of rows equals the total number of models in the set, and the number of columns equals the number of model descriptors.


Functional form of the reagents penalty function

Reagent penalty terms are selected using the Reagent Penalties entry in the Restraints menu, available from the LIBRARY ANALYSIS card in the COMBI-CHEM I deck (for more information on reagent penalties, see Specifying reagent penalty restraints).

Any combination of five terms can be defined in the Reagents Penalty Function:

1.   Minimize number of different reagents.

2.   Minimize number of different suppliers.

3.   Minimize total monetary cost of the reagents.

4.   Minimize sum of supplier penalties.

5.   Minimize sum of reagent penalties.

Each of the penalty functions above has an associated user specified penalty. Suppose that of three molecules in the library subset, two use reagent X and the third uses reagent Y. Say the cost of reagent X is 1, and the cost of reagent Y is 2. There are then two types of penalties associated with reagent cost. The first is simply the sum of the costs for each molecule:

P = (1 + 1 + 2) = 4

The second is the sum of the reagent costs for each reagent category, which in this case is simply the cost of reagent X plus the cost or reagent Y, or:

P = (1 + 2) = 3

The total reagent penalty function is the sum of five terms, one for each of the functions described above:

Eq. 29            

where NPen is the number of penalty terms (from 1 to 5) that are being used in a particular run.

Penalty based on number of different reagents (P1)

Eq. 30            

where Nreagents is the number of different reagents selected in the subset and Nmax is the maximum number of different reagents that can be used, which is equal to the minimum of:

the number of molecules in the subset times the number of Rgroups.

the sum of the number of unique fragments in each Rgroup.

This term has a minimum value of 1/Nmax (only one reagent is used) and a maximum value of 1 (Nmax reagents are used).

Penalty based on number of different suppliers (P2)

Eq. 31            

where Nsuppliers is the number of different suppliers selected in the subset and Nmax is the maximum number of different suppliers that can be used, which is equal to the minimum of:

the number of molecules in the subset times the number of Rgroups.

the number of suppliers in suppliers table.

P2 has a minimum value of 1/Nmax (only one supplier is used) and a maximum value of 1 (Nmax suppliers are used).

Penalty based on total monetary cost (P3)

Eq. 32            

where the Maximum reagent cost is the cost of the most expensive reagent among the available reagents.

Note that how the Average cost of reagents in the subset is calculated depends on whether the Consider only presence/absence of reagents flag is checked.

For example, if there are five available reagents with the following properties...

Table 6

Reagent Cost Reagent Penalty Supplier Penalty
reagent1   1.0   0   0  
reagent2   2.0   2   0  
reagent3   7.0   4   0  
reagent4   10.0   8   5  
reagent5   20.0   3   5  

...and a given subset has 3 reagents:

reagent2, reagent4, reagent2

...then P3 will be calculated as follows:

Consider only presence/absence of reagent checked:

Consider only presence/absence of reagent unchecked:

P3 has a minimum value of 0 (Average cost of reagents in the subset = 0) and a maximum value of 1 (all reagents used have a cost equal to the Maximum reagent cost).

Penalty based on reagent penalties (P4)

Eq. 33            

The Maximum reagent penalty is the maximum possible value for a reagent penalty, which we have set to 10.

Note that how the Average reagent penalty of reagents in the subset is calculated depends on whether the Consider only presence/absence of reagents flag is checked.

For example, for a subset of 3 reagents:

reagent2, reagent4, reagent2

...P4 will be calculated as follows:

Consider only presence/absence of reagent checked:

Consider only presence/absence of reagent unchecked:

P4 has a minimum value of 0 (Average reagent penalty of reagents in the subset is 0) and a maximum value of 1 (all reagents used have a reagent penalty equal to the Maximum reagent penalty).

Penalty based on supplier penalties (P5)

Eq. 34            

The Maximum supplier penalty is the maximum possible value for a reagent penalty, which we have set to 10.

Note that how the Average supplier penalty of reagents in the subset is calculated depends on whether the Consider only presence/absence of supplier flag is checked.

For example, for a subset of 3 reagents:

reagent2, reagent4, reagent2

...P5 will be calculated as follows:

Consider only presence/absence of reagent checked:

Consider only presence/absence of reagent unchecked:

P5 has a minimum value of 0 (Average supplier penalty of supplier in the subset is 0) and a maximum value of 1 (all reagents used have a reagent penalty equal to the Maximum supplier penalty).


Conclusions

Combinatorial libraries can be rationally designed to achieve high model diversity. The diversity functions based on the differences in molecular properties (intermolecular distances in property space) and the stochastic optimization technique implemented in C2·Diversity can be used to select and visualize maximum-diversity subsets from a set of fragments to be attached at R-group positions and also to analyze the diversity of the generated library members.

The 3D visualization of the models in property space allows an assessment of the effectiveness of the descriptors in differentiating the models and provides insight into how different sets of descriptors result in the selection of different subsets of diverse models.



MSI Product Previous Next Contents Index Top

Last updated May 19, 2000 at 01:52PM Pacific Daylight Time.
Copyright © 2000, Molecular Simulations Inc. All rights reserved.