MSI Product Previous Next Contents Index Top
Combi-Chem


4h. Library comparison and augmentation

Back to the Combinatorial Chemistry Methodologies index.

Introduction

The library comparison capabilities found in the C2·LibCompare module allow you to define and compare combinatorial libraries (or any sets of models) in terms of diversity and sampling of property space.

The functionality in the C2·LibCompare module is accessed through the LIBRARY COMPARISON card in the COMBI-CHEM I deck.

Library definition

The library definition capabilities allow you to classify sets of rows in the study table into different groups, or libraries. The Define Libraries control panel is opened by clicking the Define Libraries menu item on the LIBRARY COMPARISON card in the COMBI-CHEM I deck.

Controls in this panel allow you to:

When a library is defined, two new columns are added to the table: Library Name, with a specified name for the library, and Library, which contains a number that identifies the library in the table and can be used to color-code models by library in a 3D plot.

Comparing library diversity

The library comparison capabilities allow you to compare two libraries in their coverage of property or principal component space. You can use library comparison to decide if one library is more diverse than another. It may be valuable to perform library comparisons in library acquisition strategies where compound diversity is critical.

It is often possible to compare two libraries by qualitative visual inspection of the compounds in property or principal component space. The library comparison functionalities provide a quantitative measurement which should confirm your visual assessments.

Several methods are available for comparing libraries. To access the control panels that give access to this functionality, click the Compare Libraries menu item on the LIBRARY COMPARISON card in the COMBI-CHEM I deck. Then select one of the menu items.

The diversity integral method
The diversity integral method of library comparison proceeds as follows:

1.   Generate a set of random points that sample the property space covered by the two libraries and calculate the distance between each random point and the closest models in library 1 and library 2.

2.   Calculate the sum (integral) over all the minimum distances between random points and library 1 and random points and library 2, normalizing by the number of random points.

3.   The library with the lowest value of this sum is considered to be more diverse (that is, it better samples the property space occupied by the two libraries) by this method.

You have control over several options that govern the library comparison process:

where Dij is the distance between model i and random point j, the sum is over all the independent variables (properties) k from 1 to N, Xi,k is the value of property k for model i, and Xj,k is the value of property k for random point j.

The similarity method
The similarity method simply compares each molecule in the Candidate Library with all molecules in the Reference Library, calculating the minimum average and maximum distance values.

The Compare Libraries Similarity control panel gives you a number of options for displaying the data in plot, histogram or table formats as well as an option to select rows in the study table which satisfy certain comparison criteria (i.e., user-specified minimum, average and maximum distances).

The cell-based comparison method
A cell-based comparison method of comparing libraries has been added to the Library Comparison module [Pearlman, R. S.; Wang, X. C.; Xu, Y.; Green, M. "Novel methods for assessing and comparing the diversity of chemical libraries" 218th ACS meeting, New Orleans, August 22-26 (1999)]. This method compares two libraries by binning the property space occupied by the two libraries and counting the number of molecules from the reference and candidate libraries that occupy each cell. Several metrics may be used to compare the libraries.

When using the option to compare libraries based on counting empty or occupied cells (see below), a cell is considered occupied by reference or by candidate molecules only if the number of molecules is greater than or equal to a user-specified minimum.

The cell-based library comparison method is accessed from the Compare Libraries/Cell Based menu item on the LIBRARY COMPARISON card in the COMBI-CHEM I deck, which opens the Compare Libraries Cell Based control panel.

Cell-based library comparison can be used with data in BDF files or in the QSAR study table. The space occupied by the two libraries can be binned to obtain a specified number of total cells or a specified number of cells occupied by at least one molecule. The optimum binning algorithm, which tries to divide the properties to create cells with sides as similar as possible, is used in both cases.

The comparison metrics can be calculated in two ways (Compare Libraries Based on popup):

For example, assuming that the space is divided into 10 cells with the following number of candidate and reference molecules in each one:

Cell number Candidate
molecules
Reference
molecules
1   1   8  
2   5   3  
3   2   0  
4   0   6  
5   4   10  
6   2   1  
7   0   9  
8   0   0  
9   1   11  
10   3   0  

Then the comparison metrics using a) empty or occupied cells with a minimum of 1 molecule per cell to consider it occupied, b) empty or occupied cells with a minimum of 3 molecules per cell to consider it occupied, or c) taking into account the actual number of molecules per cell, are:

Metric a) Empty or
occupied
(min = 1)
b) Empty or
occupied
(min = 3)
c) Actual
number of
molecules in cells
Cells with candidate mols   7   3   7  
Cells with reference mols   7   6   7  
Cells with candidate and reference mols   5   2   5  
Cells with candidate or reference mols   9   7   9  
Tanimoto coefficient   0.56   0.29   0.56  
Hamming distance   4   5   46  
Distance   2.00   2.24   17.89  
Percentage overlap   71.43   33.33   71.43  
Carbo index   0.71   0.47   0.48  
Hodgkin index   0.71   0.44   0.32  

You can also use the by Reference Molecules popup to select molecules in the candidate library that occupy cells not occupied by reference molecules (different or new molecules) or select candidate molecules that are in cells already occupied by reference molecules (similar molecules).

Cosine-coefficient diversity and similarity
To set up and apply the cosine-coefficient diversity metric (see Theory), use the Compare Libraries Cosine Coeff Diversity and the Rgroup subsetting Diverse Library control panels. Open the Compare Libraries Cosine Coeff Diversity control panel by selecting Compare Libraries/Cosine Coeff Diversity on the LIBRARY COMPARISON card. Open the Rgroup subsetting Diverse Library control panel by selecting Rgroup Subsetting/Diverse Library on the LIBRARY ANALYSIS card and then set the Diversity Metric popup to Cosine-Coeff Div.

Cosine-coefficient similarity

The cosine-coefficient similarity metric is used to compare two libraries, computing the diversity of library A (candidate library), the diversity of library B (reference library), and the change in diversity when library A is added to library B. This metric works with numeric descriptors and with 2D fingerprints.

The control panel (Compare Libraries Cosine Coeff Diversity) that enables you to compare libraries using cosine-coefficient similarity with data in the study table and in BDF files is accessed from the Compare Libraries/Cosine Coeff Diversity menu item on the LIBRARY COMPARISON card.

Fingerprints OnBits metrics
See also 3D Pharmacaphore fingerprints (3DKeys). The fingerprints OnBits metric can be used with both 2D and 3D fingerprints. It is based on generating a "modal fingerprint" for a set of N molecules, in which a bit is on if it is present in at least one molecule in the set.

In the functionality accessed by selecting the Rgroup Subsetting/Diverse Library menu item on the LIBRARY ANALYSIS card and setting the Diversity Metric (in the Rgroup subsetting Diverse Library control panel) to Fingerprint OnBits, libraries are designed to maximize the number of on bits in the modal 2D and/or 3D fingerprint of the sublibrary.

In the functionality accessed from the Rgroup Subsetting/Focused Library menu item on the LIBRARY ANALYSIS card and setting the Distance Metric (in the Rgroup subsetting Diverse Library control panel) to Fingerprint OnBits, libraries are designed to maximize the number of common bits between the modal fingerprint of the sublibrary and a target fingerprint.

Fingerprint metrics are also available in Library Comparison. To access the Compare Libraries 3D Fingerprints Onbits control panel, select Compare Libraries/3D Fingerprints Onbits from the LIBRARY COMPARISON card.

3D fingerprint focusing

The modal 3D fingerprint of the candidate library is compared with the modal fingerprint of the reference library, reporting the number of on bits in each library, the number of common bits, the number of on bits in the candidate library not present in the reference library, and the number of on bits in the reference library not present in the candidate library. Options in the Compare Libraries 3D Fingerprints Onbits control panel allow you to list the molecules in the candidate library with on bits present in the reference library and to select the top N molecules from the candidate library (the ones with the highest number of common bits).

The option to Create Histogram of New and Common Pharmacophores plots the frequency of new and common on bits when the molecules in the candidate library are compared with the modal 3D fingerprint of the reference library.

Distance-based library augmentation

The distance-based library augmentation functionality enables you to select a diverse set of models from a specified library to add to a previously defined library. Both libraries must have been defined using the library definition functionality described above.

This functionality is accessed by opening the Complement Library control panel (select the Complement Library menu item on the LIBRARY COMPARISON card in the COMBI-CHEM I deck).

You can control several options:

Clicking the Preferences... pushbutton opens the Analysis Preferences control panel, which allows you to fine-tune other settings for the diversity selection experiment (stochastic optimization of the diversity of the combined set of models).

The stochastic optimization proceeds similar to a distance-based diverse selection, except that a subset of compounds (the library to add to) is not allowed to vary and is maintained as a fixed selection throughout the optimization.

Hole identification and hole filling

The hole identification functionality enables you to find the largest unsampled areas (holes) of the property space covered by a combinatorial library. Those holes can then be filled with compounds to complement the original library.

The hole identification capabilities are accessed by opening the Find Holes control panel. Do this by selecting the Holes in Property Space/Find Holes menu item on the LIBRARY COMPARISON card in the COMBI-CHEM I deck.

The functionality available includes:

Note

A new column containing the sizes of the holes is added to the study table.

After holes are found, they can be filled using the hole-filling functionality. Use the Fill Holes control panel, which is opened by selecting the Holes in Property Space/Fill Holes menu item from the LIBRARY COMPARISON card in the COMBI-CHEM I deck.

For each hole, the model closest to the hole center and within the hole size is selected. If no model within the hole size is found, then none is selected.

3D fingerprint hole finding and filling

The modal 3D fingerprint of the candidate library is compared with the modal fingerprint of the reference library, reporting the number of OnBits in each library, the number of common bits, the number of OnBits in the candidate library not present in the reference library, and the number of OnBits in the reference library not present in the candidate library.

Options allow you to list the molecules in the candidate library with OnBits not present in the reference library and to select the top N molecules from the candidate library (the ones with the highest number of new OnBits).

Distance histogram library comparison

Distance histogram library comparison provides an easy method to examine candidate complement libraries. Libraries can be ranked based on the number of interesting compounds that could complement an existing collection. In addition, the members of the candidate library that do complement an existing set can be isolated from the rest of the offering. Distance histogram library comparison can also be used to identify how well a given subset of a library is able to represent the complete set.

Access distance histogram library comparison by selecting Compare Libraries - Similarity from the LIBRARY COMPARISON card.

Options for Library Comparison include:

Preferences for Library Comparison include:

Note

The library comparison scheme proceeds as follows: for every compound in the Candidate Library, distances to all compounds in the Reference Library are investigated. Depending on the distance measurement option, the minimum, maximum or average of these distances is retained. The set of distances obtained can then be plotted individually or as a histogram plot over the distance range. Results can also be analyzed in a Study Table.

Upon completion, the Compare Libraries - Similarity scheme provides several graphs. The default Similarity Plot (left) and Similarity Histogram (right) are shown below:

Analysis of distance histograms
Histograms of minimum distance distribution are probably the most useful for the identification of library complements. Comparing two candidate libraries against a single reference library, the candidate library having the distribution of minimum distances shifted to larger values can be described as being a better source of complementary compounds.

In the analysis of a single minimum distance histogram, those compounds that correspond to the left-hand side of the histogram are those for which there exists at least one close (similar) compound in the reference library. These compounds are therefore redundant with the reference library. The compounds that correspond to the right-hand side of the histogram are those for which there are no close (similar) compounds in the reference library. Those compounds are therefore of interest as potential complements to the reference library. A distance threshold can be set in the selection process so as to isolate those compounds from the rest of the candidate library.

The same procedure may be used to determine whether a subset of a library correctly represents the entire set of compounds. In this case, the entire set should be taken as the candidate library and the subset used as the reference library. The library comparison procedure will check that for every compound in the entire library there is an appropriate representative in the subset.

One key aspect of this methodology is the identification of proper distance thresholds. In other words, how different is different enough? Studies on the validation of descriptors have been conducted by Brown and Martin and have suggested thresholds for use with fingerprint descriptors. A value of 0.85 has been suggested for Tanimoto similarity coefficients corresponding to a Tanimoto distance of 0.15 (1 - 0.85).



MSI Product Previous Next Contents Index Top

Last updated May 19, 2000 at 01:52PM Pacific Daylight Time.
Copyright © 2000, Molecular Simulations Inc. All rights reserved.