MSI Product Previous Next Contents Index Top
Combi-Chem


4d. Library analysis techniques

Back to the Combinatorial Chemistry Methodologies index.

Choosing molecular diversity descriptors

The appropriate model descriptors to apply to a diversity problem depend on several factors:

Using molecular diversity descriptors

Standard descriptor sets

A default set of descriptors has been defined for use in combinatorial chemistry applications. You can switch to this set by selecting Preferences/Defaults Set/COMBICHEM from the menu bar in the Study Table control panel. This set includes descriptors from several descriptor families such as electronic, spatial, structural, thermodynamic, and topological properties.

Note

Some of the descriptors listed above require a C2·Descriptor+ license

3D field descriptors

3D field descriptors from MFA may be used for diversity applications. They can be found in the FIELD ANALYSIS (MFA) card located in the QSAR deck.

These descriptors can be used to efficiently characterize steric and electrostatic properties of molecules. They can be associated with the diversity tools in C2·Diversity to provide reduced sets of models that provide adequate sampling of 3D space in terms of steric and electrostatic properties. For example, a reduced set of reagents may be selected to represent the full set obtained from database searching. Although computations with field descriptors can be lengthy, columns containing the largest amount of variance can be easily and identified and marked (see Cerius2 QSAR+ documentation). This is particularly useful for speeding up diversity computation with these descriptors.

Note

Fingerprint descriptors

Fingerprint data can be calculated and stored in the study table. Fingerprints are analyzed and used as descriptors for diversity and similarity calculations. The fingerprints are displayed as hexadecimal digit character strings in the table.

Cerius2 release 3.6 provides interfaces to Isis keys and Daylight fingerprints. They can be accessed by selecting the corresponding entries in the Combichem Descriptors Database:

ISIS Keys
ISIS fingerprint data can be calculated and loaded into the study table using a new ISIS_keys entry in the Descriptors control panel. You can select all 960 keys or the 166-bit subset of public keys or both for calculation and loading.

Select the Descriptors/Select... item on the toolbar in the Study Table control panel to open the Descriptors control panel. From this panel, choose ISIS keys from the Descriptors in family popup and click the ISIS key... pushbutton. This opens the ISIS keys control panel, which contains check boxes for selecting either the full set or the public subset of ISIS keys.

After key selection, add the ISIS_key descriptor column(s) to the study table by clicking the ADD pushbutton in the Descriptors control panel. If the models are already present in the table, ISIS Host starts up and the fingerprints are evaluated and displayed. Alternatively, model addition can be deferred until after the descriptor columns have been added to the table.

Note

Daylight fingerprints
Daylight fingerprints can be calculated and loaded into the study table using the Descriptors control panel.

Select the Descriptors/Select... item from the menu bar in the Study Table control panel to open the Descriptors control panel. From this panel, set the Descriptors in family popup to Daylight and click the action button that's to the left of the Display popup. This shows the list of Daylight descriptors as a table in the Descriptors control panel.

After selecting the Daylight fingerprints row in this table, add the fingerprint column to the study table by clicking the ADD pushbutton in the Descriptors control panel. The Daylight fingerprints are extracted as a fixed-length (1024 bits) string and appear as a new column labeled DYFP-1024.

Note

The Daylight interface for input/output of SMILES strings and for calculating fingerprints and other descriptors is available only for SGI machines running IRIX 6.2 or higher.

Fingerprint columns added as descriptors to the study table (and set as independent variables) can be used in MDS, clustering (except for relocation clustering), selection of diverse models (distance-based), and selection of similar models. Running MDS automatically generates a 3D plot of the first three MDS components.

Fingerprint columns can be used in combination with other fingerprints (ISIS and Daylight) or in combination with numeric descriptors.

Catalyst descriptors

Catalyst HypoFit descriptors
Descriptors based on fit to Catalyst pharmacophores or hypotheses can be calculated and included in the study table by selecting the appropriate entry in the combichem descriptors database.

1.   Create the Catalyst database (.bdb) file if it hasn't been created yet. To do this, you need an .sd file containing the study models. Assuming that example.sd contains the study models, the following Catalyst commands may be used to create the database file:


>	catDB CONFIG example.bdb 

>	catDB sd example.sd example.bdb MaxConfs=20 No1D
2.   With the Study Table control panel open, select Descriptors/ Select... to open the Descriptors control panel. Set the Descriptors in family popup to HypoFit, then click the HypoFit... pushbutton. This opens the Hypothesis Fitting control panel.

The panel contains two file browsers, one for database file (.bdb) selection and one for hypothesis file (.chm) selection. The hypothesis file browser allows more than one file to be selected. When the Flexible fit check box is checked, flexible hypothesis fitting is performed in addition to the default rigid hypothesis fitting.

The selected database file name is displayed on the control panel below the database browser. Selected hypothesis file names are displayed in a list box in the Hypothesis Files control panel, which can be opened by clicking the Selected hypotheses... pushbutton. This panel can be used to adjust hypothesis file selections if you decide to remove some (or all) of the selections you first made. To remove all selections, click the Remove All pushbutton. To remove only some of the selections, highlight the desired filenames by clicking them in the list box and click the Remove Highlighted pushbutton.

3.   Now click the ADD pushbutton in the Descriptors control panel to add the HypoFit descriptors to the study table. If the study table already contains the models, catSearch and hypofitDriver begin executing as separate processes, resulting in an .esp file containing the results of the requested hypothesis fitting. These results are then loaded into the study table. If the models are not in the study table, then the cells are computed (and catSearch/hypofitDriver run) whenever the models are added to the study table. If the output .esp file resulting from a catSearch/hypofitDriver run corresponding to the specified database/hypothesis already exists, its contents are loaded into the study table immediately, and catSearch/hypofitDriver is not run.

The following naming conventions are used. Assuming the database file name is database.bdb and one of the hypothesis file names is hypo.chm:

Column names in the study table have the form:

The hypothesis and the database filenames are used as parts of column names (any leading directories and the .chm and .bdb suffixes are stripped). The extensions :1, :2, etc. are added to assure uniqueness of column names if identical file names from different directories are used.

The output.esp file(s) produced by catSearch/hypofitDriver are named hypo_1_out_rigid.esp for results of rigid fit, or hypo_1_out_flex.esp for results of flexible fit and are saved in the local directory in which Cerius2 is run. The _1 unique extension in the .esp filename matches the :1 unique extension of the corresponding column name.

If the output descriptor .esp file contains values for models not in the study table, these are ignored. If there are study models for which descriptors are not available in the output .esp file, warning messages are displayed in the text window.

Note

If an output .esp file already exists in the Cerius run directory whose name exactly matches (see the example below) a column name, catSearch/hypofitDriver are not rerun, and the descriptors are loaded into the study table directly from the output .esp file. Consequently, to load a previously generated output .esp file into the study table, you have to make sure the name of the .esp file matches the column name exactly (including the unique extension numbers). Example:


>	mv hypo_5_out_rigid.esp hypo_3_out_rigid.esp
The descriptors are now loaded from hypo_3_out_rigid.esp directly, without running catSearch/hypofitDriver.

Catalyst CatShape descriptors

This release of Cerius2 contains enhancements to QSAR+ that enable users to add CatShape descriptors to the QSAR study table. CatShape descriptors include minimum, maximum, range, and average values for the molecular volume and the extents along three axes aligned with the principal moments of inertia of the conformers of each molecule. The procedure for obtaining the CatShape descriptors is briefly discussed here, but please refer to the Catalyst documentation for further details.

1.   Create a Catalyst database (.bdb file) from an SD file containing the models for which you want to calculate the catShape descriptors.

2.   In the Study Table control panel, select Descriptors/Select... from the menu bar to open the Descriptors control panel. Set the Descriptors in family popup to CatShape, then click the CatShape... pushbutton. This opens the Shape Descriptors control panel. The top part of the panel is a file browser for choosing the .bdb file created using Catalyst in Step 1. The check boxes in the bottom part of the panel are used to select and deselect descriptors to be added to the table. There are two volume descriptors and 15 principal axes descriptors. The two check boxes in this panel select both volume descriptors and all 15 principal axes descriptors, respectively. If you want to select specific descriptors, do so by clicking the Detailed Selection... pushbutton to open the Principal Aces control panel, which enables detailed selection of specific catShape descriptors.

3.   Now click the ADD pushbutton in the Descriptors control panel to add the catShape descriptors to the study table. If the study table already contains models, then catShape descriptor cells are evaluated and the values are displayed. If the models are not in the study table, then the cells are computed whenever the models are added to the study table. If the descriptor file specified contains values for models not in the study table, these are ignored. If there are study models for which descriptors are not available in the descriptor file, warning messages are displayed in the text window.

Daylight descriptors

In addition to Daylight fingerprints, this release gives access to these Daylight descriptors:

To use any of these descriptors, open the COMBICHEM defaults in the study table. To do this, select Preferences/Defaults Set/COMBICHEM from the Study Table control panel. Now, select the Descriptors/Select... menu item. The descriptors appear in the descriptor list as the Daylight family.

Note

Some of the functionality of the Cerius2 3.6 release (Daylight descriptors, Daylight fingerprints, and SMILES export) requires Daylight licensing. Other functionality, such as SMILES import, does not require a Daylight license, but the Daylight SMILES file-reader is available along with the MSI SMILES file-reader if it is licensed.

Kier and Hall E-state descriptors

Kier and Hall electrotopological descriptors (E-state) are accessed by setting the Descriptors in family popup in the Descriptors control panel to E_state_keys and then clicking the E_state_keys... pushbutton. The E-state Fingerprints control panel (shown below) allows you to select the type of descriptors: E-state type sums (sum of the electrotopological descriptors for each atom type), E-state type counts (counts of each atom type in the model), and E-state type indicators (presence or absence of each atom type). You can also specify which elements are to be taken into account.

For more information on Kier and Hall E-state descriptors, please consult Hall & Kier 1995 and Hall et al. 1992. (see the References chapter).

Calculating molecular diversity descriptors

To select a database, select the Descriptors/Databases... menu item in the Study Table control panel. This opens the Descriptor Database control panel, which lets you change the currently installed database set by setting a popup (the choices are QSAR, COMBICHEM, QSPR, and Other...) and then clicking the OPEN DATABASE pushbutton. If you are performing a diversity analysis, make sure the currently selected database is COMBICHEM.

To select specific sets of descriptors, open the Descriptors control panel by selecting the Descriptors/Select... menu item on the Study Table control panel. The Descriptors control panel contains a list of the descriptors in the currently selected database and offers options for choosing which descriptors are listed and selected and whether they are to be included in the study. Once you have made your selections, add them to the study table by clicking the ADD pushbutton. Alternatively, you can quickly add all of them by clicking the Add default descriptors icon on the Study Table control panel's toolbar.

Note

The descriptors calculation runs faster if done row-wise, because the model can be cached. Caching is done when you add the descriptors to the table before you generate the analogs (if possible).

Descriptor calculation for combinatorial libraries (C2·LibEngine)

Introduction to LibEngine

LibEngine produces fingerprints and other (Lipinski) descriptors and optionally runs clustering on large combinatorial libraries. This is done without having to enumerate the library structures, but produces fingerprints and descriptors which are identical to those obtained on enumerated structures.

Input is an MDL RG file, which may be exported from the Analog Builder or obtained from other software such as MDL's Project Library or Central Library. Fingerprints and, optionally, cluster numbers are output to a bdf file specificed in the output file dialog.

The fingerprints are similar to the ISIS Keys in that they index predefined fragments stored in a dictionary. Three dictionaries are provided with the software; one should be selected using the dictionary file dialog.

By default, the fingerprints are generated and then processing stops. To continue on to run the clustering, a number of clusters betwen 1 and the total number of molecules contained within the RG file should be specified in the Clusters required box. K-means relocation clustering is then run and a cluster number for each molecule in the library is added to the RG file.

Accessing the tools

Select COMBI-CHEM II from the list of menu decks, and click LIB ENGINE to bring the LibEngine card forward. Click Setup and Run to launch the Run LibEngine control panel.

Using LibEngine

1.   Enter an input MDL RG file in the Input file dialog, either by selecting it from the file browser or typing it into the text box.

2.   Enter a name for the Output BDF file in the Output BDF file dialog, either by selecting it from the file browser or typing it into the text box.

3.   Select a Dictionary file, either by selecting it from the file browser or typing it into the text box.

4.   Optionally, select and Output Smiles file, either by selecting it from the file browser or typing it into the text box.

5.   Select one or more of the following:

Generate Fingerprints Generate fingerprint descriptors from the specified library.

Generate Structural Descriptors To modify, click the Preferences... button next to Generate Structural Descriptors to bring up the LibEngine Structural Descriptors control panel. Specify the Rotatable bonds, Hbond donors and Hbond acceptors definitions you wish to modify.

Output SMILES Although descriptor calculation does not require enumeration of structures, LibEngine does provide an option to enumerate a library to SMILES. The SMILES notation used by LibEngine differs from conventional DayLight SMILES, however, these SMILES can be canonicalized automatically using the Canonicalise SMILES button (requires the Daylight tool kit).

Number of Clusters Specify the number of Clusters to be generated (0 for fingerprints only).

6.   Click RUN LIB_ENGINE to run generate the fingerprint/ descriptor/cluster generation program.

Loading molecular diversity descriptors

Diversity assessment can also be initiated by loading previously calculated descriptor data into a study table.

If you have a table of data that you want to analyze, you can select the File/Open... menu item in the Study Table control panel to open the Open Study Table control panel. Use this panel to find and load the desired table file.

If you have a table of data that you want to combine with current data or models in an open study table, you need to import the data. Select the File/Import... menu item in the Study Table control panel to open the Import Table control panel. This panel allows you to specify the features of the ASCII file of tabular data that you want to import. Importing brings the data into the Table Manager control panel, where you can edit the data or decide to re-import it if you specified the incorrect format.

Once the information is in the table manager, you can add it to the information already in the study table by using the Edit/Paste Special... menu item in the Study Table control panel. By default, the information is pasted into new rows and columns, but you can specify otherwise. For example, if the rows of the table to import are from the same models as the information currently in the study table, check Join using Row Index before pasting, so as to add the new data to the corresponding rows in the study table.

Managing molecular diversity descriptors

Several tools are available to help you manage the set of descriptors to be used for characterizing molecular diversity. The descriptors that are used for most analysis procedures in Cerius2 (QSAR regressions, principal component analysis, factor analysis, multidimensional scaling, and cluster analysis) are defined as independent variables and labeled with an X in their column headings in the study table.

The Study Table control panel offers several options for specifying dependent and independent variables:

The Manage descriptors control panel facilitates the management of descriptors present in the study table. Open it by selecting the Descriptors/Manage Descriptors... item from the menu bar in the Study Table control panel.

With the Manage descriptors control panel, you can:

Fast calculation of 2D descriptors

A faster way of calculating most of the 2D descriptors available in Cerius2 has been implemented. The descriptors for which fast calculation is enabled are:

The fast descriptor calculation is about 100 times faster than the standard descriptor calculation in Cerius2. This speedup is attained by bypassing the creation of Cerius2 models and the posting of descriptor values in the QSAR study table before exporting them to BDF or DAT files:

Comparison of standard and fast calculation

For each molecule, the standard descriptor calculation requires five steps, two of which require much time:

1.   Create Cerius2 model from MOL or SMILES (slow!).

2.   Cache Cerius2 model into local_mol structure.

3.   Calculate descriptors.

4.   Post results to study table row (slow!).

5.   Export row to BDF and/or datafile.

For each molecule, the fast descriptor calculation requires only three steps, omitting the slow steps above:

1.   Cache molecule information in local_mol structure.

2.   Calculate descriptors.

3.   Export values to BDF and/or datafile.

Performing fast descriptor calculation

The Fast Descriptors control panel can be accessed in several ways:

Input molecules are read from an SD or SMILES file (no charge calculation or minimization can be performed in this mode), all or selected descriptors are calculated, and the results are output to a binary datafile (BDF file) and/or an ASCII datafile (DATA).

To input molecules from an SD file, select the SD option in the File popup (in the Fast Descriptors control panel) and click the Select Molecules from button. The Select SD file for Fast Descriptors control panel appears.

To input molecules from a SMILES file, select the SMILES option in the File popup (in the Fast Descriptors control panel) and click the Select Molecules from button. The Select SMILES file for Fast Descriptors control panel appears.

These control panels for input of molecules enable you to select all or specific molecules from the files, to calculate descriptors for the largest fragment only or for all atoms in the molecule, and to choose either the Daylight or the internal Cerius2 reader to parse SMILES strings.

The descriptors to be calculated are selected by clicking the Select Descriptors button in the Fast Descriptors control panel, which opens the Select Fast Descriptors control panel.

The table on the left side of the Select Fast Descriptors control panel contains all the descriptors that can be calculated by the fast method. Selecting one or more rows in this table and clicking the arrow between the table and the list box to its right enters the corresponding descriptors into the list box. Some descriptors, such as MW and Rotbonds, consist of only one value and therefore create only one entry in the list box. Other descriptors, such as AlogP types and Chi indices, are actually descriptor groups and create multiple entries in the list box.

The actual number of descriptors associated with multiple descriptors is controlled by the preferences for the corresponding descriptor family. Set these preferences by choosing the desired family (Structural, Topological, Information, Thermodynamic, E_State_keys, or Substructure) from the Set Preferences for popup and clicking the Set Preferences for button.

The Load Fast Descriptors Set and Save Fast Descriptors Set control panels allow you to load and save specific sets of descriptors for later use. Several predefined sets are provided in the Cerius2-Resources/COMBICHEM/demos directory, including combi_fast.fds (44 structural, thermodynamic, and topological descriptors), fastdesc_structural.fds (MW, Rotbonds, HBA, HBD, AlogP98), fastdesc_topological.fds (37 topological descriptors), and fastdesc_atomtypes.fds (154 AlogP atom type and E-state key descriptors).

ISIS keys and Daylight fingerprints

You can create BDF files directly from Isis keys or Daylight fingerprint files, which enables faster calculation of Isis keys and Daylight fingerprints and faster loading into the QSAR study table. This new functionality is available in the 2D Fingerprints ISIS keys and 2D Fingerprints Daylight control panels, which are opened by clicking the ISIS keys and Daylight Fingerprints pushbuttons in the Select Fast Descriptors control panel.

3D Pharmacaphore fingerprints (3DKeys)

See also Fingerprints OnBits metrics.

A 3D fingerprint for a molecule is defined as the collection of all combinations of three features (triplets) and 4 features (quadruplets) in 3D space for all conformers. Each multiplet is characterized by a set of feature types and the corresponding inter-feature distances. The program CatFeatures, in the Catalyst environment, is used to identify the features present in the molecules. The possible features considered are:

Once the features are identified, Cerius2 is used to construct the 3D Fingerprints for all molecules. The diagram below illustrates the steps and programs required to calculate 3D fingerprints for a library or any set of molecules in an SD file.

Creating feature (FEA) file from a BDB file

Accessing the tools

In the Study Table, select Descriptors/3D Fingerprints... to bring up the 3D Fingerprints control panel.

Creating the features file within Cerius2

If a feature file does not exist, it must be created using CatFeatures. Although this program is not a part of Cerius2 and can be executed independently, the catFeatures command can be issued from the Create Features File panel.

1.   On the 3D Fingerprints panel, click the Create Features File button to bring up the Create Features File control panel.

2.   Under Select features, select the features to be included in the feature dictionary.

3.   Select the Catalyst database file (.dbd file) to be converted.

4.   Enter an appropriate name for the features file in the .fea file name text box, or accept the default.

5.   Click CREATE FEATURES FILE.

Creating the features file outside Cerius2

Make sure that you have sourced the script <C2_install_dir>/cat400/cshrc to set up the required Catalyst environment.

1.   Build a Catalyst database (.bdb files) from an SD file.

2.   Create the features file. The features file should contain just one line for each feature. The possible features are:

NEG CHARGE

POS CHARGE

NEG IONIZABLE

POS IONIZABLE

HB ACCEPTOR

HB DONOR

RING AROMATIC

HYDROPHOBIC

You can omit some of the features, but the ones you keep should preserve the order shown above.

3.   Enter the following command after the Unix prompt (all on one line):


>	$CATALYST_BIN/catFeatures <name>.bdb -getMapping -allHitConfs 
-featuresFile features_file -maxhits 9999999 -mappedOutputFile <name>.fea
where:

<name>.bdb is the name of the .bdb file (for example: mymols.bdb)

<name>.fea is the name of the output features file (for example: mymols.fea)

features_file is a file that contains the features you want to consider, as described in step 2.

Creating 3D fingerprint (3PF, 4PF) file from a feature (FEA) file

In this step a binary 3D fingerprint file is obtained from the feature file.

1.   On the 3D Fingerprints panel, click Create Fingerprint File button to open the Create Binary Fingerprint File control panel.

2.   Under Select features, select the features to be included in the fingerprint file.

3.   Select the feature file (.fea file) to be converted.

4.   Enter an appropriate name for the 3D Fingerprints file in the 3pf/4pf file name text box, or accept the default.

5.   Optional: Note that you can opt to create 4-point pharmacophores (instead of the default 3-point pharmacophores) by checking the Create 4-features Pharmacophores check box.

6.   Optional: Note that you can modify grid size (default 10 x 10), spacing (2.0Å), and separation (0.3Å).

7.   Click CREATE 3D FINGERPRINTS FILE.

Both the input FEA and the output 3PF/4PF files can be specified in the usual manner. Only those features selected under Select features may be included in the 3PF/4PF fingerprint file. Any other features will be ignored. Note, however, that selected features that are not present in the FEA file will also be ignored (with a warning message in the textport).

The remaining options on the panel specify the parameters of the planar grid to which all the feature positions are "snapped" before writing them out as fingerprints to the 3PF/4PF file.

Grid size: number of grid points along the side. Size N describes a square grid of N x N points.

Grid spacing: distance between adjacent grid points (in Ångstroms).

Minimum Separation: features in a triplet have to be separated by at least this distance (in Ångstroms) to be considered a pharmacophore. If any of the three distances is smaller than this threshold, the triplet is ignored.

Adding a 3D fingerprint column to the study table

3D fingerprint columns with data corresponding to different 3PF/4PF files can be added to the study table by the standard procedure of selecting and adding descriptors from the Descriptors control panel.

1.   From the Study Table, select Descriptors/Select... to bring up the Descriptors control panel.

2.   Set the Descriptors in family popup to 3D_Fingerprints and click the 3D_Fingerprints... pushbutton.

3.   Back on the Descriptors panel, click ADD to add the new column to the table. The name of the column consists of the characters 3DFP3- for 3-point fingerprints and 4DFP4- for 4-point fingerprints, followed by the 3PF/4PF file name.

The 3D fingerprints are evaluated and stored in a binary file (the 3PF/4PF file). Because of their size they are not loaded into the study table but read from the disk as needed for calculations. However, a memory buffer of adjustable size is provided to reduce excessive disk input/output. The study table cells are filled with numbers of pharmacophores present in the compounds.

If the 3PF/4PF file does not exist yet, two pushbuttons linked to the panels showed in the two subsections above are provided for convenience.

Selecting similarity coefficients and displaying pharmacophores

Similarity coefficients are selected by using the Similarity Coefficient popup on the 3D Fingerprints preference panel. Bring the preference panel first by selecting Descriptors ... 3D Fingerprints... from the study table toolbar:

Note that the panel contains three pushbuttons for displaying pharmacophores in the textport:

List 3-Point Pharmacophore
show the pharmacophore that would correspond to the bit position in the text box. Information is displayed for all independent 3D fingerprint study table columns. You can highlight one or more study table columns to obtain pharmacophore information only for those columns.

List 4-Point Pharmacophore
show the pharmacophore that would correspond to the bit position represented by the integer pair in the text boxes (First index, Second index). As above, you may limit the information to selected columns.

For these two buttons, displayed pharmacophore information is presented in the following format (example is of a 3-point pharmacophore listing):


File name: ./test.3df
size: 10, grid spacing: 2.00
Features present in the file: NEG POS NEGI POSI HBA HBD RING HYD
HBA POS NEG 8.00 15.23 8.49
List Pharmacophores for Selected Rows
show pharmacophores present in the 3PF/4PF data files for compounds contain in highlighted study table rows.

For this button, displayed pharmacophore information is presented in the following formats.

3-point:


File name: ./mao_1634.3pf 
Grid size: 10, grid spacing: 2.00
Features present in the file: NEG POS NEGI POSI HBA HBD RING HYD
Compound name: 92 Pharmacophores d12 d13 d23
1. HYD HBA HBA 2.00 4.47 2.83
2. HBA HBA HYD 2.00 2.83 2.00
3. HYD HBA HYD 2.00 4.47 2.83
4. HYD HBA HYD 2.00 4.47 4.00
4 point:


File name: ./monopep.4pf
Grid size: 10, grid spacing: 2.00, min. separation: 1.00
Features present in the file: NEG POS HBA HBD
Compound name: monopeptide-lys Pharmacophores d12 d13 d23 d14 d24 d34 sign 1. HBD HBA HBA HBD 4.00 4.47 2.00 6.63 3.46 2.83 +
2. HBD HBA HBA HBD 4.00 6.32 2.83 6.63 3.46 2.00 -
3. HBD HBA HBA HBD 6.00 6.32 2.00 6.32 2.00 2.83 +
4. HBD HBA HBA HBD 6.00 6.32 2.00 6.63 2.83 2.00 +
By convention, features in the first column are positioned at the grid origin, features in the second column along the x-axis, features in the third column on the xy-plane, and, in the 4-point scheme, the fourth feature falls on a grid point in the xyz-space.

Distances between features in first and second column are labelled "d12", and similarly for remaining distances. In each row in the 4-point scheme, the "sign" column refers to the sign of the z coordinate of the fourth feature: "-" for z < 0 and "+" otherwise.

You can also obtain pharmacophore information corresponding to the "key" entered in the two integer data entry boxes under the List 4-Point Pharmacophore pushbutton.

Additional options on the panel include:

Maximum Memory 3D Fingerprints (Mb)
This sets the maximum of the memory buffer for 3D fingerprint-based similarity/distance calculations. singe the fingerprint bit information is stored on files and not in the study table, the disc access times may become prohibitive for setting up similarity and distance information. A buffer can be used in those cases to perform this step in-memory instead.

Pharmacophore visualization in Cerius2

Pharmacophore visualization allows you to view the grid-bound pharmacophores in the model window.

A concrete example

1.   On the 3D Fingerprints control panel open the Visualize Pharmacophores control panel by clicking Visualize Pharmacophores .

2.   Set the browser filter under Include only Pharmacophores Containing.

Suppose you set the browser filter to (at least) two HBDs and one NEGI (see figure above).

3.   Click Browse Pharmacophores for Selected Row and Column.

In our example, say this resulted in selecting 96 out of 762 pharmacophores present in row 3 of the study table (see figure above).

4.   Under Select Row, choose a pharmacophore for viewing.

In our example, say you select the eighth pharmacophore for viewing (see figure above).

5.   The index pair representing the selected pharmacophore appears in the data entry boxes under Plot 4-Point Pharmacophore (see figure above).

The pharmacophore plot for this browser example above might look like:

Features represented by Name:

Features represented by Atom (Assuming BALL & STICK atom display style):

In a 4-point pharmacophore display the first three features (in the xy-plane) are connected by yellow edges. White edges connect these features to the remaining feature (in the xyz-space). The default orientation plots the first feature (at the grid origin) in the lower left, the second feature (along the x-axis) horizontally to the right, and the third feature (in the xy-plane) above the first two.

The three remaining pushbuttons near the top of the panel display pharmacophores that would correspond to fingerprint "keys" (or "bits") you enter into the data entry boxes. (These "keys" may or may not be present in the actual binary fingerprint file).

File conversion utilities

Three binary file conversion utilities can be accessed from the File Conversion Utilities panel:

convert 3DF binary fingerprint file to the new 3PF format

convert 3PF binary fingerprint file to ASCII (a "3TX" file)

convert 4PF binary fingerprint file to ASCII (a "4TX" file)

To bring up the File Conversion Utilities control panel, click File Conversion Utilities... on the 3D Fingerprints control panel.

Type in the input file name in the supplied data entry box and click the corresponding pushbutton.

The name of the output file is automatically generated by stripping the existing .3df/.3pf/.4pf suffix and replacing it with .3pf/.3tx/.4tx, respectively. If the input file name does not end in these suffixes, the new suffix is merely appended.

If the automatically generated output file name would result in overwriting an existing file, a warning message appears and user has option either to proceed or to rename the input (or the existing output) file in order to avoid losing a file.

Creating a catalyst query from single pharmacophore

To create a Catalyst query file (.chm file) from a single 3-point (triplet) or 4-point (quadruplet) pharmacophore, click the Create Catalyst Query from Single Pharmacophore button to bring up the Catalyst Query from Single Pharmacophore control panel.

The pharmacophore can be defined by entering the feature types and inter-feature distances, or by entering the pharmacophore index from an existent 3D fingerprint file. The resulting .chm query file can then be used in a 3D database search.

Creating a 3D Fingerprint file (.3pf, .4pf) from a single pharmacophore

To create a Cerius2 3D fingerprint file (.3pf, .4pf) from a single 3-point (triplet) or 4-point (quadruplet) pharmacophore, click the Create3D Fingerprint from Single Pharmacophore button to bring up the 3DFP from Single Pharmacophore control panel.

The pharmacophore can be defined by entering the feature types and inter-feature distances, or by entering the pharmacophore index from an existent 3D fingerprint file.

Merging binary fingerprint files

To bring up the Merging of Binary 3D Fingerprint Files control panel, click the Merge 3D Fingerprint Files.

This tool combines two binary fingerprint files of the same type (either 3PF or 4PF) into single file of that type. This eliminates the necessity of recalculating the fingerprints from FEA files when adding compounds and their fingerprints to existing binary files.

Files to be merged must satisfy the following criteria:

To merge two binary files:

1.   Find the first binary file in the browser on the left hand side of the control panel and click Select first binary 3D fingerprint file or simply type the new name in the text box.

2.   Find the second binary file in the browser and click Select second binary 3D fingerprint file or simply type the new name in the text box.

3.   Find the new file name and click Select as new binary 3D fingerprint file (if overwriting an existing file) or simply type the new name in the text box.

4.   Click MERGE 3PF/4PF FILES to merge the two files.

Note

If the same compound name is present in both files, it is assumed the corresponding fingerprints are identical (since the names and all of the grid/feature parameters match). In this case, the resulting merged output file will contain both compound names with duplicated fingerprint information.

Loading binary fingerprint files into the study table

You can load single 3PF/4PF fingerprint files into the study table directly, without using the SD file. This operation overwrites any existing information in the table. The models themselves are not loaded into Cerius2, but the statistical analysis tools accessible from the table can be applied to the data.

To load a binary fingerprint file into the table, select click the Load 3D Fingerprint File to Study Table button to bring up the Loading Binary 3D Fingerprint File to Study Table control panel.

1.   Select the 3D fingerprint file either through a browser (opened by clicking Browse...) or by typing the 3PF/4PF file name into the text box.

2.   You may supply an SD file name in the Name of SD file text box. This is used to store information needed to recover the molecules in the study table (by using Molecules/Recover Molecules ).

Statistical analyses and data-mining techniques

Analysis of property distribution

Property histograms
A histogram representation of property distribution can easily be obtained for any column in the study table. Simply select the columns of interest, which then are highlighted in black, and click the Histograms icon in the Study Table control panel's toolbar.

The histograms and integrals for the desired properties appear in the Cerius2 Graphs window. This is especially useful for determining the distribution of specific properties (such as LogP) in compound collections and combinatorial libraries. For example, the design of libraries may be tuned to mimic the distribution of specific properties in known drug databases.

Fingerprint histograms
Histogram representation is also available for fingerprint data. The Fingerprints control panel provides access to a histogram representation of the population of the fingerprint bins across the collection of compounds. To open the Fingerprints control panel, select the Descriptors/Fingerprints... item from menu bar in the Study Table control panel.

To display the histogram(s), define the desired fingerprint column as the independent data column and click the Histogram action button in the Fingerprints control panel.

The example below shows histograms corresponding to a data set of 250 models with one histogram for each of:

In each histogram, the key number N corresponds to a vertical bar situated between N-1 and N on the x-axis. The height of the bar shows the number of models with the corresponding key set.

Color distribution plots
A Color Distribution Plot control panel, for visualizing large datasets, is opened by selecting the Tools/Graphics/Color plots... item from the menu bar in the Study Table control panel or by clicking the Color Plot icon on the toolbar. The distribution, or color, plots provide 2D representations of the study table by color coding individual cells according to a specific property that you specify. You can set the Color by popup in the Color Distribution Plot control panel to any of these properties:

The Color Distribution Plot control panel provides controls for selecting and/or scrolling to the corresponding column/row upon clicking in the plot and also controls for zooming and scrolling the plot.

Analysis of descriptor correlation

The statistical tools in Cerius2 enable you to analyze sets of descriptors and check for possible correlations between descriptors. This information can be obtained across the full set of descriptors contained in a study table.

To perform this analysis, select the Tools/Statistical/Correlation Matrix item from the menu bar in the Study Table control panel or click the Correlation matrix icon in the toolbar. The correlation coefficient between any pair of descriptors in the study table is output in the Correlation Matrix control panel.

You can include both principal components and descriptors in this analysis. Principal components are orthogonal to each other (correlation coefficient is zero). You can also find out which descriptor(s) best represent the first component and select a meaningful representative that is quick to compute. This may be an efficient way to reduce the number of descriptors used in an analysis.

Advanced binning

An enhanced binning functionality assists in choosing a representative subset from a set of compounds characterized by various molecular properties.

This is done by first dividing each molecular property range into a specified number of ranges (bins) and assigning the molecules to the resulting cells in property space, where each cell is characterized by the range (bin) that it covers each property. Bin boundaries are also referred to as thresholds.

Cell population can then be visualized as a histogram, and the contents of each cell can be browsed by interactive queries. Representative compounds can be selected from each cell manually or through the Select Molecules/Diverse/Cell-based menu item on the COMBI-CHEM I/LIBRARY ANALYSIS card.

It is also possible to add compounds to an existing binning from a secondary library and to have both the histogram and the interactive browser distinguish them visually.

Binning

Binning functionality is accessed through the ADVANCED BINNING card in the COMBI-CHEM I card deck:

The data for binning are loaded into the study table first. Property space is partitioned into cells by dividing each property into bins or ranges as defined on the Define Binning control panel:

Clicking the Load properties button makes the independent numeric study table columns available for binning and lists the corresponding property names. Each property can now be selected and binned separately (possibly according to different criteria) or binned all at once. Available binning criteria are:

Number of stddevs = 0, two bins:
minimum mean, and
mean maximum.

Number of stddevs = n where n is > 0, three bins:
minimum mean - (n X stddev),
mean - (n X stddev) mean + (n X stddev), and
mean + (n X stddev) maximum.

The manual binning method utilizes the binning table in the Binning Thresholds control panel, which should be opened first by clicking the Display binning table button. The binning table displays the minimum, maximum, and threshold values for current binning. If available, it also displays the mean and standard deviation.

If the property to be manually adjusted is already present in the binning table and is divided into the required number of bins, the adjustment of minimum, maximum, and/or threshold values is done as follows:

1.   Enter the new value directly into the binning table cell.

2.   Highlight the corresponding property in the Properties list box on the Define Binning control panel.

3.   Click the Bin current property button on the Define Binning control panel.

If the property to be manually binned is not yet in the binning table or is divided into a different number of bins than desired, an extra step is necessary before entering new cell values:

4.   Highlight the property in the Properties list box on the Define Binning control panel, enter the desired number of bins in the Number of bins entry box, and click the Add thresholds to binning table button.

The highlighted property thresholds are readjusted and filled in with the default values corresponding to uniformly spaced bins. These values can now be adjusted as described above.

Saving and loading binnings

Binnings can be saved and retrieved for reuse. Minimum, maximum, and threshold values, as well as property names, are saved as an ASCII .bin file. Clicking Load binning or Save binning in the Define Binning control panel opens the Load Binning or Save Binning control panel, containing a file browser. For this functionality to interface properly with the subsequent analysis step (see below), the order of the independent column names in the study table (the X-column order) must match the order of properties listed in the Define Binning control panel Properties list box.

If you want to load an existing binning only to adjust it and save it again, the study table need not be present.

Analysis

At this stage, the data in the study table are examined as each compound is assigned an appropriate bin. The resulting cell contents are then displayed in the 3D model window as a clickable histogram. To display it, use the Bin Analysis control panel first (accessed by clicking the Analyze binning button on the Define Binning control panel or the Analyze Binning item on the ADVANCED BINNING card) to specify these parameters:

Browser display means that selected cells are shown in a Cell Browser control panel and is possible only if the study table has been loaded from a single SD file whose name appears in the Filename column in the study table.

The histogram labels are of the form


   M  (b1, b2, ..., bN)  (K)   

where M is a cell id used internally to keep track of the cells. It encodes the bin assignments for each property, which are listed next as N-tuples of integers (b1, b2, ..., bN). For example, cell id 3 in the illustration corresponds to bins (3,1,1,1,1,1,1,1,1) or: the third bin of the first property, first bin of the second property, and so on. The terms "first" and "second", etc., properties refer to property ordering as shown in the Properties list box in the Define Binning control panel. The number K shows the number of compounds in the cell.

The right-hand endpoints of the histogram bars are shown by cyan dots. Clicking them lists the cell contents in the text window (always) and displays the cell contents in a Cell Browser control panel (if Browse models was checked in the Bin Analysis control panel):

You can cycle through the molecules shown in the display window on the right side of the Cell Browser by clicking the triangular step-forward and -back buttons above the window. You can also select compounds for display by clicking the cells in the Name column of the table on the left side of the Cell Browser.

The Selected column in this table consists of No/Yes toggles. Compounds in rows marked Yes are added to existing row selections in the study table when Select Rows In Study Table is clicked.

Another library can be added to an existing histogram. On the Bin Analysis control panel, select the library to add and click the associated Add from library button. The new additions are marked yellow on the histogram, and the Cell Browser control panel displays a line of text showing the new library information.

The remaining controls on the Bin Analysis control panel are:

Handling large libraries

Handling large libraries of several tens to several hundreds of thousands of structures presents specific challenges. Several tools are available to help you deal with this situation. However, please be aware that dealing with very large libraries may require significant amounts of memory, even if proper guidelines are followed.

Handling structures

Cerius2 enables you to delete compounds after they have been added to a study table while still maintaining a link to the structural data for later analysis. To take advantage of these capabilities, proceed as follows:

1.   Start with an empty study table.

2.   Select the descriptors you want to use.

3.   Add these descriptors to the empty study table.

4.   Import the compounds from an SD file (Molecules/From SD File... item in the Study Table control panel's menu bar).

5.   In the Add Molecules from SD File control panel, click the Preferences... pushbutton.

6.   In the SD File Preferences control panel, check the Delete Model After Adding and Add SD File Name, Type and Index to Table check boxes.

This prevents structures from being maintained in memory.

The remaining limit is the size of the final study table. In our experience, an Indigo2 R10000 with 128 MB of memory easily handles a table of 50,000 rows and 50 descriptors.

File-based system

A file-based system for diversity analysis and QSAR methods facilitates working with large combinatorial libraries or datasets. Molecular descriptors are calculated and saved in compact, binary data files (BDF), which can be accessed directly by Diversity and QSAR methods in Cerius2 without having to load all data into memory at once.

The BINARY DATA FILES card in the COMBI-CHEM I card deck consolidates functionality that enables you to select and use binary datafiles (BDF). This card contains these menu items.

Select BDF menu item

The Binary Data File control panel is used to select a BDF file, select the properties to be used as independent or dependent variables, and run a QSAR method that operates on BDF files (PCA, clustering, MDS, FA, LDA, RP). If no properties are set as independent variables, the selected (highlighted) properties are taken as independent variables.

Create BDF menu item

The Create BDF menu has three items: From Study Table, From Data File, and Merge BDF.

Export BDF to Table menu item

The Export BDF data to Study Table control panel is used to import selected columns and rows from the BDF file to the study table.

3D Plot from BDF menu item

The 3Dplot from BDF control panel is used to create a 3D plot in the Cerius2 Model Manager from the selected columns and rows in the BDF file.

BDF Preferences menu item

The BDF Preferences control panel is used to specify options to use with BDF files.

Generation of bdf files by Cerius2 processes

Bdf files are generated by the following Cerius2 processes:

Use of bdf files

The following Diversity and QSAR methods can make use of bdf files:

How to generate bdf files

You can generate bdf files in several ways:

1.   From the Analog Builder

The Analog Builder Preferences control panel allows you to specify the name of a bdf file to be generated as analogs in the combinatorial library are enumerated and sent to the study table to calculate molecular descriptors. All other Analog Builder preferences are in effect when generating a bdf file. Thus, to create a bdf file for a large library, turning on the options to delete the analog and the study table row after properties are calculated for the current analog results in the most efficient use of memory.

2.   Importing molecules from an .sd file

The SD File Preferences control panel allows you to specify the name of a bdf file to be generated when molecules are read from an .sd file into the study table and molecular descriptors are calculated. To minimize the memory requirements, the options to delete the model and the row after adding the molecule to the study table should be used.

Information about the original sd file is included in the bdf file. This information can then be used to try to recover the molecular structures from the sd file when importing data from the bdf file into the study table.

Merging files

Row names are used in file merge operations. This avoids duplicate rows in extend-row merges and avoids matching the wrong values from new columns in extend-column merges. In addition, merging bdf files generated from different SD files does not result in the SD indices being discarded. Each row in the merged file has its own SD index and SD filename.

3.   Exporting data from the study table

You can export existing Cerius2 study tables to bdf files. Columns marked as independent (X) variables in the QSAR study table can be exported directly to a bdf file by selecting the File/Export to BDF menu item from the study table.

When the study table is exported to a bdf file, the column derivations are saved. These are not available for execution while in BDF form, but they are restored to the study table when the bdf file is later imported. The PCA loadings generated with file-based methods are translated to derivation form when imported.

Selecting the File/Export to BDF menu item from the study table opens the Create Binary Data File control panel, which you can use to specify the name of the bdf file you want to create.

4.   Exporting data from an existing ASCII .dat file

You can create bdf files from existing .dat datafiles. Go to the BINARY DATA FILES card in the COMBI-CHEM I card deck and select the Create BDF/From Data File menu item to open the Create Binary Data File control panel. You can use it to create a binary bdf file from an ASCII .dat file. All the columns in the .dat file are exported to the bdf file.

How to use binary data files (BDF)

You can use bdf files for several purposes:

1.   Diversity or QSAR analysis (1.)

2.   PCA (2.)

3.   Select diverse and select similar methods (3.)

4.   Importing data into the study table (4.)

1.   Selecting bdf files for Diversity or QSAR analysis

Once a bdf file has been generated, it can be selected for Diversity or QSAR methods by selecting the Select BDF menu item inon the BINARY DATA FILES card, which opens the Binary Data File control panel.

When you select a .bdf file using the file browser on the left side of the control panel (by double-clicking the filename or by selecting the file and then clicking SELECT) the file information box at the top of the control panel is automatically updated to show the filename, the number of rows and descriptors in the file, and information about Rgroups (number of Rgroups and number of fragments in each one), if any. Also, the names of the descriptors present in the file are shown in the list box to the right of the file browser, and the Get Data from Binary Data File checkbox is checked. You can now select which descriptors from the bdf file to use for analysis, by selecting them manually or by using the Select all and Deselect all buttons.

The selected bdf file is now ready to be used. The text area below the toolbar in the study table toolbar indicates that data will be read from a bdf file and not from the study table.

Note

If data will be read from the study table (the Get Data from Binary Data File checkbox is unchecked), then the text area reads Data from Study Table:

2.   PCA with bdf files

To run principal component analysis (PCA) using the selected bdf file, just click the RUN buttonnext to the PCA popup in the study table. All the PCA options (set in the Statistical Method Preferences control panel) are valid when using bdf files, except for the option to create 3D plots for the samples (scores) and the descriptors (loadings). Upon successful completion of the PCA run, a new bdf file containing all the rows of the original bdf file, but with the original descriptors replaced by the principal components, is created.

PCA on multiple bdf files

Column selection is performed on a single bdf file. However, you can select additional files as collected input to one PCA run, by using the file list box in the BDF Preferences control panel. The output loadings can then be applied to each input bdf file in turn. You do not need to generate actual PC columns (with associated disk usage) to use the PCs. PC1, PC2, etc., that appear in the column-selection box when name of the .dep file (below) is input. These can be selected and used in any subsequent analysis, such as select-diverse.

Use of bdf with dependent parameters and derivations

Bdf-based PCA results are available as a set of loadings that can be applied to any other bdf file. These are in a file with the extension .dep. Click the Browse pushbutton near the bottom of the Binary Data File control panel to open the BDF Dependent File control panel, which enables you to load the dependent parameters and derivations. PCA loadings resulting from one bdf file can be applied to any other bdf file without performing any merge. Simply select the alternative bdf file in the Binary Data File control panel and the desired PCA output (.dep file) in the BDF Dependent File control panel.

3.   Select diverse and select similar with bdf files

To run select diverse (distance-based and cell-based) and select similar methods using bdf files, just make sure that the bdf file is selected and open the corresponding control panel (Select Diverse, Cell-Based Selection, and Select Similar, respectively) to carry out the selection. Options that apply to bdf files appear in the control panels.

Checking the Create New BDF File with Selected Rows check box creates a new bdf file with all the descriptors that were present in the original bdf file, but only the selected (diverse or similar) rows.

Checking the Import Selected BDF Rows into Study Table check box allows you to input data for selected rows into the study table for further analysis or visualization.

Checking the Import BDF Extremes into Study Table check box selects the rows that correspond to the minimum and maximum values of each descriptor (a maximum of 2P rows, where P is the number of descriptors) and imports them into the study table for analysis and visualization.

Checking the Try to Recover Molecules check box tries to recover the original molecule structures corresponding to the selected rows or extremes when they are imported into the study table. Molecules are recovered from a .sd file that is optionally associated with the bdf file when the bdf file is generated.

4.   Importing data from bdf files into the study table

The Export BDF to Table pushbutton near the bottom of the Binary Data File control panel opens the Export BDF data to Study Table control panel, which allows you to move data from the selected bdf file directly into the study table. Use the radio buttons to input all rows from the file, a range of rows, every Nth row, rows selected at random from the file, or rows specified in a bdf_rows file (see below).

Only the descriptors selected in the list are imported into the study table. If Recover Molecules is checked and there is an .sd file associated with the selected bdf file, the molecule structures corresponding to the imported rows are recovered and placed in the Cerius2 Model Manager.

Data imported from bdf files into the study table can be visualized and analyzed like data from any other source. The only difference is that, when creating 3D plots in which the data are normalized, the normalization factors should correspond to all the molecules (rows) in the bdf file, not to only the rows that were imported into the study table. To facilitate this, the 3D Plot Samples control panel has a Normalization Data from Current BDF File check box to indicate that the normalization factors should not be calculated from the data in the study table, but that they should be obtained from the currently selected bdf file.

Plotting selected rows

If you want to plot only certain rows, rows can be selected by using a file of row numbers. Select the 3D Plot from BDF pushbutton on the Binary Data File control panel to open the 3Dplot from BDF control panael. Then choose the Rows from File radio button and enter a filename in the File entry box. The file should contain only row numbers, one per line. A file with the selected BDF rows is automatically created when you run any of these methods: select diverse (distance and cell based), select similar, R-group subsetting. This file is named bdf_rows and is placed in the run directory.

Selecting rows according to specifications in the bdf_rows file allows you to visualize selected subsets after a Diversity calculation, by following these steps:

1.   Before running the Diversity calculation (select diverse, select similar, R-group subsetting) create a 3D plot for all the rows in the bdf file.

2.   When the diversity calculation is done, a file with the selected BDF rows, called bdf_rows, is automatically created and used to generate a new 3D plot, named BDF Selection, with only the selected rows displayed in red. You can now overlap the two 3D plots to visualize the selection among all points in the bdf file.

Color coding by property

A selected property can be used to color code the 3D plot by clicking the Color using Selected Property button in the 3Dplot from BDF control panel, then reselecting the three columns you want to plot.

Handling descriptors

You may be able to reduce the set of descriptors used in the analysis and therefore allow more compounds to be handled in the Study Table. The following procedure may be applied to select a reduced set of descriptors for analysis:

1.   Start from an empty Study Table.

2.   Select the descriptors you want to use.

3.   Add these descriptors to the empty Study Table.

4.   Select a subset of the models to be imported. (The Add Molecules from SD File control panel enables you to select a range of compounds to be imported rather than the complete SD file. Select the Range button on the Add Molecules from SD File control panel and specify the desired range of models to be imported.)

5.   Perform the PCA on the compound subset.

6.   Calculate the correlation matrix between the descriptors and the principal components.

7.   Eliminate some descriptors from your choice, based on their correlation with other descriptors. Alternatively, you can choose to represent each of the principal components by one descriptor (based on its correlation coefficient with the principal component and speed of computation).

8.   Restart from an empty Study Table.

9.   Add the reduced set of descriptors to the empty Study Table.

10.   Import the full set of models for analysis.



MSI Product Previous Next Contents Index Top

Last updated May 19, 2000 at 01:51PM Pacific Daylight Time.
Copyright © 2000, Molecular Simulations Inc. All rights reserved.