| Combi-Chem |

Choosing molecular diversity descriptors
The appropriate model descriptors to apply to a diversity problem depend on several factors:
|
Some of the descriptors listed above require a C2·Descriptor+ license
|
3D field descriptors
3D field descriptors from MFA may be used for diversity applications. They can be found in the FIELD ANALYSIS (MFA) card located in the QSAR deck.
Fingerprint descriptors
Fingerprint data can be calculated and stored in the study table. Fingerprints are analyzed and used as descriptors for diversity and similarity calculations. The fingerprints are displayed as hexadecimal digit character strings in the table.
Select the Descriptors/Select... item on the toolbar in the Study Table control panel to open the Descriptors control panel. From this panel, choose ISIS keys from the Descriptors in family popup and click the ISIS key... pushbutton. This opens the ISIS keys control panel, which contains check boxes for selecting either the full set or the public subset of ISIS keys.
After key selection, add the ISIS_key descriptor column(s) to the study table by clicking the ADD pushbutton in the Descriptors control panel. If the models are already present in the table, ISIS Host starts up and the fingerprints are evaluated and displayed. Alternatively, model addition can be deferred until after the descriptor columns have been added to the table.
Daylight fingerprints
Daylight fingerprints can be calculated and loaded into the study table using the Descriptors control panel.
|
The Daylight interface for input/output of SMILES strings and for calculating fingerprints and other descriptors is available only for SGI machines running IRIX 6.2 or higher.
|
Fingerprint columns added as descriptors to the study table (and set as independent variables) can be used in MDS, clustering (except for relocation clustering), selection of diverse models (distance-based), and selection of similar models. Running MDS automatically generates a 3D plot of the first three MDS components.
Catalyst descriptors
Catalyst HypoFit descriptors
Descriptors based on fit to Catalyst pharmacophores or hypotheses can be calculated and included in the study table by selecting the appropriate entry in the combichem descriptors database.1. Create the Catalyst database (.bdb) file if it hasn't been created
yet. To do this, you need an .sd file containing the study models.
Assuming that example.sd contains the study models, the
following Catalyst commands may be used to create the database
file:
> catDB CONFIG example.bdb > catDB sd example.sd example.bdb MaxConfs=20 No1D
2. With the Study Table control panel open, select Descriptors/ Select... to open the Descriptors control panel. Set the Descriptors in family popup to HypoFit, then click the HypoFit... pushbutton. This opens the Hypothesis Fitting control panel.
The panel contains two file browsers, one for database file (.bdb) selection and one for hypothesis file (.chm) selection. The hypothesis file browser allows more than one file to be selected. When the Flexible fit check box is checked, flexible hypothesis fitting is performed in addition to the default rigid hypothesis fitting.
3. Now click the ADD pushbutton in the Descriptors control
panel to add the HypoFit descriptors to the study table. If the
study table already contains the models, catSearch and hypofitDriver
begin executing as separate processes, resulting in an
.esp file containing the results of the requested hypothesis fitting.
These results are then loaded into the study table. If the
models are not in the study table, then the cells are computed
(and catSearch/hypofitDriver run) whenever the models are
added to the study table. If the output .esp file resulting from a
catSearch/hypofitDriver run corresponding to the specified
database/hypothesis already exists, its contents are loaded into
the study table immediately, and catSearch/hypofitDriver is
not run.
The following naming conventions are used. Assuming the database file name is database.bdb and one of the hypothesis file names is hypo.chm:
The output.esp file(s) produced by catSearch/hypofitDriver are named hypo_1_out_rigid.esp for results of rigid fit, or hypo_1_out_flex.esp for results of flexible fit and are saved in the local directory in which Cerius2 is run. The _1 unique extension in the .esp filename matches the :1 unique extension of the corresponding column name.
If the output descriptor .esp file contains values for models not in the study table, these are ignored. If there are study models for which descriptors are not available in the output .esp file, warning messages are displayed in the text window.
> mv hypo_5_out_rigid.esp hypo_3_out_rigid.espThe descriptors are now loaded from hypo_3_out_rigid.esp directly, without running catSearch/hypofitDriver.
Catalyst CatShape descriptors
This release of Cerius2 contains enhancements to QSAR+ that enable users to add CatShape descriptors to the QSAR study table. CatShape descriptors include minimum, maximum, range, and average values for the molecular volume and the extents along three axes aligned with the principal moments of inertia of the conformers of each molecule. The procedure for obtaining the CatShape descriptors is briefly discussed here, but please refer to the Catalyst documentation for further details.
1. Create a Catalyst database (.bdb file) from an SD file containing the models for which you want to calculate the catShape descriptors.
Daylight descriptors
In addition to Daylight fingerprints, this release gives access to these Daylight descriptors:
Kier and Hall E-state descriptors
Kier and Hall electrotopological descriptors (E-state) are accessed by setting the Descriptors in family popup in the Descriptors control panel to E_state_keys and then clicking the E_state_keys... pushbutton. The E-state Fingerprints control panel (shown below) allows you to select the type of descriptors: E-state type sums (sum of the electrotopological descriptors for each atom type), E-state type counts (counts of each atom type in the model), and E-state type indicators (presence or absence of each atom type). You can also specify which elements are to be taken into account.
Calculating molecular diversity descriptors
To select a database, select the Descriptors/Databases... menu item in the Study Table control panel. This opens the Descriptor Database control panel, which lets you change the currently installed database set by setting a popup (the choices are QSAR, COMBICHEM, QSPR, and Other...) and then clicking the OPEN DATABASE pushbutton. If you are performing a diversity analysis, make sure the currently selected database is COMBICHEM.
|
The descriptors calculation runs faster if done row-wise, because the model can be cached. Caching is done when you add the descriptors to the table before you generate the analogs (if possible).
|
Descriptor calculation for combinatorial libraries (C2·LibEngine)
LibEngine produces fingerprints and other (Lipinski) descriptors and optionally runs clustering on large combinatorial libraries. This is done without having to enumerate the library structures, but produces fingerprints and descriptors which are identical to those obtained on enumerated structures.
Select COMBI-CHEM II from the list of menu decks, and click LIB ENGINE to bring the LibEngine card forward. Click Setup and Run to launch the Run LibEngine control panel.
5. Select one or more of the following:
6. Click RUN LIB_ENGINE to run generate the fingerprint/ descriptor/cluster generation program.
Loading molecular diversity descriptors
Diversity assessment can also be initiated by loading previously calculated descriptor data into a study table.
Managing molecular diversity descriptors
Several tools are available to help you manage the set of descriptors to be used for characterizing molecular diversity. The descriptors that are used for most analysis procedures in Cerius2 (QSAR regressions, principal component analysis, factor analysis, multidimensional scaling, and cluster analysis) are defined as independent variables and labeled with an X in their column headings in the study table.
With the Manage descriptors control panel, you can:
1. Create Cerius2 model from MOL or SMILES (slow!).
2. Cache Cerius2 model into local_mol structure.
4. Post results to study table row (slow!).
5. Export row to BDF and/or datafile.
For each molecule, the fast descriptor calculation requires only three steps, omitting the slow steps above:
1. Cache molecule information in local_mol structure.
3. Export values to BDF and/or datafile.
Performing fast descriptor calculation
The Fast Descriptors control panel can be accessed in several ways:
To input molecules from an SD file, select the SD option in the File popup (in the Fast Descriptors control panel) and click the Select Molecules from button. The Select SD file for Fast Descriptors control panel appears.
To input molecules from a SMILES file, select the SMILES option in the File popup (in the Fast Descriptors control panel) and click the Select Molecules from button. The Select SMILES file for Fast Descriptors control panel appears.
These control panels for input of molecules enable you to select all or specific molecules from the files, to calculate descriptors for the largest fragment only or for all atoms in the molecule, and to choose either the Daylight or the internal Cerius2 reader to parse SMILES strings.
The descriptors to be calculated are selected by clicking the Select Descriptors button in the Fast Descriptors control panel, which opens the Select Fast Descriptors control panel.
The table on the left side of the Select Fast Descriptors control panel contains all the descriptors that can be calculated by the fast method. Selecting one or more rows in this table and clicking the arrow between the table and the list box to its right enters the corresponding descriptors into the list box. Some descriptors, such as MW and Rotbonds, consist of only one value and therefore create only one entry in the list box. Other descriptors, such as AlogP types and Chi indices, are actually descriptor groups and create multiple entries in the list box.
The actual number of descriptors associated with multiple descriptors is controlled by the preferences for the corresponding descriptor family. Set these preferences by choosing the desired family (Structural, Topological, Information, Thermodynamic, E_State_keys, or Substructure) from the Set Preferences for popup and clicking the Set Preferences for button.
The Load Fast Descriptors Set and Save Fast Descriptors Set control panels allow you to load and save specific sets of descriptors for later use. Several predefined sets are provided in the Cerius2-Resources/COMBICHEM/demos directory, including combi_fast.fds (44 structural, thermodynamic, and topological descriptors), fastdesc_structural.fds (MW, Rotbonds, HBA, HBD, AlogP98), fastdesc_topological.fds (37 topological descriptors), and fastdesc_atomtypes.fds (154 AlogP atom type and E-state key descriptors).
In the Study Table, select Descriptors/3D Fingerprints... to bring up the 3D Fingerprints control panel.
If a feature file does not exist, it must be created using CatFeatures. Although this program is not a part of Cerius2 and can be executed independently, the catFeatures command can be issued from the Create Features File panel.
1. On the 3D Fingerprints panel, click the Create Features File
button to bring up the Create Features File control panel.
2. Under Select features, select the features to be included in the
feature dictionary.
3. Select the Catalyst database file (.dbd file) to be converted.
5. Click CREATE FEATURES FILE.
Creating the features file outside Cerius2
Make sure that you have sourced the script <C2_install_dir>/cat400/cshrc to set up the required Catalyst environment.
1. Build a Catalyst database (.bdb files) from an SD file.
3. Enter the following command after the Unix prompt (all on one line):
> $CATALYST_BIN/catFeatures <name>.bdb -getMapping -allHitConfs
-featuresFile features_file -maxhits 9999999 -mappedOutputFile <name>.fea
1. On the 3D Fingerprints panel, click Create Fingerprint File button to open the Create Binary Fingerprint File control panel.
2. Under Select features, select the features to be included in the
fingerprint file.
3. Select the feature file (.fea file) to be converted.
7. Click CREATE 3D FINGERPRINTS FILE.
Both the input FEA and the output 3PF/4PF files can be specified in the usual manner. Only those features selected under Select features may be included in the 3PF/4PF fingerprint file. Any other features will be ignored. Note, however, that selected features that are not present in the FEA file will also be ignored (with a warning message in the textport).
1. From the Study Table, select Descriptors/Select... to bring up the Descriptors control panel.
The 3D fingerprints are evaluated and stored in a binary file (the 3PF/4PF file). Because of their size they are not loaded into the study table but read from the disk as needed for calculations. However, a memory buffer of adjustable size is provided to reduce excessive disk input/output. The study table cells are filled with numbers of pharmacophores present in the compounds.
Selecting similarity coefficients and displaying pharmacophores
Similarity coefficients are selected by using the Similarity Coefficient popup on the 3D Fingerprints preference panel. Bring the preference panel first by selecting Descriptors ... 3D Fingerprints... from the study table toolbar:
List 3-Point Pharmacophore
File name: ./test.3df
size: 10, grid spacing: 2.00
Features present in the file: NEG POS NEGI POSI HBA HBD RING HYD HBA POS NEG 8.00 15.23 8.49
File name: ./mao_1634.3pf
Grid size: 10, grid spacing: 2.00
Features present in the file: NEG POS NEGI POSI HBA HBD RING HYD Compound name: 92 Pharmacophores d12 d13 d23
1. HYD HBA HBA 2.00 4.47 2.83
2. HBA HBA HYD 2.00 2.83 2.00
3. HYD HBA HYD 2.00 4.47 2.83
4. HYD HBA HYD 2.00 4.47 4.00
File name: ./monopep.4pf
Grid size: 10, grid spacing: 2.00, min. separation: 1.00
Features present in the file: NEG POS HBA HBD Compound name: monopeptide-lys Pharmacophores d12 d13 d23 d14 d24 d34 sign 1. HBD HBA HBA HBD 4.00 4.47 2.00 6.63 3.46 2.83 +
2. HBD HBA HBA HBD 4.00 6.32 2.83 6.63 3.46 2.00 -
3. HBD HBA HBA HBD 6.00 6.32 2.00 6.32 2.00 2.83 +
4. HBD HBA HBA HBD 6.00 6.32 2.00 6.63 2.83 2.00 +
Maximum Memory 3D Fingerprints (Mb)
2. Set the browser filter under Include only Pharmacophores Containing.
3. Click Browse Pharmacophores for Selected Row and Column.
4. Under Select Row, choose a pharmacophore for viewing.
5. The index pair representing the selected pharmacophore appears in the data entry boxes under Plot 4-Point Pharmacophore (see figure above).
The pharmacophore plot for this browser example above might look like:
File conversion utilities
Three binary file conversion utilities can be accessed from the File Conversion Utilities panel:
Type in the input file name in the supplied data entry box and click the corresponding pushbutton.
The name of the output file is automatically generated by stripping the existing .3df/.3pf/.4pf suffix and replacing it with .3pf/.3tx/.4tx, respectively. If the input file name does not end in these suffixes, the new suffix is merely appended.
If the automatically generated output file name would result in overwriting an existing file, a warning message appears and user has option either to proceed or to rename the input (or the existing output) file in order to avoid losing a file.
The pharmacophore can be defined by entering the feature types and inter-feature distances, or by entering the pharmacophore index from an existent 3D fingerprint file. The resulting .chm query file can then be used in a 3D database search.
The pharmacophore can be defined by entering the feature types and inter-feature distances, or by entering the pharmacophore index from an existent 3D fingerprint file.
This tool combines two binary fingerprint files of the same type (either 3PF or 4PF) into single file of that type. This eliminates the necessity of recalculating the fingerprints from FEA files when adding compounds and their fingerprints to existing binary files.
Files to be merged must satisfy the following criteria:
1. Find the first binary file in the browser on the left hand side of the control panel and click Select first binary 3D fingerprint file or simply type the new name in the text box.
4. Click MERGE 3PF/4PF FILES to merge the two files.
Loading binary fingerprint files into the study table
You can load single 3PF/4PF fingerprint files into the study table directly, without using the SD file. This operation overwrites any existing information in the table. The models themselves are not loaded into Cerius2, but the statistical analysis tools accessible from the table can be applied to the data. 1. Select the 3D fingerprint file either through a browser (opened
by clicking Browse...) or by typing the 3PF/4PF file name into
the text box.
Statistical analyses and data-mining techniques
Analysis of property distribution
Property histograms
A histogram representation of property distribution can easily be obtained for any column in the study table. Simply select the columns of interest, which then are highlighted in black, and click the Histograms icon in the Study Table control panel's toolbar.
Fingerprint histograms
Histogram representation is also available for fingerprint data. The Fingerprints control panel provides access to a histogram representation of the population of the fingerprint bins across the collection of compounds. To open the Fingerprints control panel, select the Descriptors/Fingerprints... item from menu bar in the Study Table control panel.
To perform this analysis, select the Tools/Statistical/Correlation Matrix item from the menu bar in the Study Table control panel or click the Correlation matrix icon in the toolbar. The correlation coefficient between any pair of descriptors in the study table is output in the Correlation Matrix control panel.
You can include both principal components and descriptors in this analysis. Principal components are orthogonal to each other (correlation coefficient is zero). You can also find out which descriptor(s) best represent the first component and select a meaningful representative that is quick to compute. This may be an efficient way to reduce the number of descriptors used in an analysis.
This is done by first dividing each molecular property range into a specified number of ranges (bins) and assigning the molecules to the resulting cells in property space, where each cell is characterized by the range (bin) that it covers each property. Bin boundaries are also referred to as thresholds.
Cell population can then be visualized as a histogram, and the contents of each cell can be browsed by interactive queries. Representative compounds can be selected from each cell manually or through the Select Molecules/Diverse/Cell-based menu item on the COMBI-CHEM I/LIBRARY ANALYSIS card.
It is also possible to add compounds to an existing binning from a secondary library and to have both the histogram and the interactive browser distinguish them visually.
The data for binning are loaded into the study table first. Property space is partitioned into cells by dividing each property into bins or ranges as defined on the Define Binning control panel:
Clicking the Load properties button makes the independent numeric study table columns available for binning and lists the corresponding property names. Each property can now be selected and binned separately (possibly according to different criteria) or binned all at once. Available binning criteria are:
mean, and
maximum.
If the property to be manually adjusted is already present in the binning table and is divided into the required number of bins, the adjustment of minimum, maximum, and/or threshold values is done as follows:
1. Enter the new value directly into the binning table cell.
3. Click the Bin current property button on the Define Binning
control panel.
If the property to be manually binned is not yet in the binning table or is divided into a different number of bins than desired, an extra step is necessary before entering new cell values:
4. Highlight the property in the Properties list box on the Define
Binning control panel, enter the desired number of bins in the
Number of bins entry box, and click the Add thresholds to
binning table button.
The highlighted property thresholds are readjusted and filled in with the default values corresponding to uniformly spaced bins. These values can now be adjusted as described above.
Saving and loading binnings
Binnings can be saved and retrieved for reuse. Minimum, maximum, and threshold values, as well as property names, are saved as an ASCII .bin file. Clicking Load binning or Save binning in the Define Binning control panel opens the Load Binning or Save Binning control panel, containing a file browser. For this functionality to interface properly with the subsequent analysis step (see below), the order of the independent column names in the study table (the X-column order) must match the order of properties listed in the Define Binning control panel Properties list box.
Analysis
At this stage, the data in the study table are examined as each compound is assigned an appropriate bin. The resulting cell contents are then displayed in the 3D model window as a clickable histogram. To display it, use the Bin Analysis control panel first (accessed by clicking the Analyze binning button on the Define Binning control panel or the Analyze Binning item on the ADVANCED BINNING card) to specify these parameters:
The histogram labels are of the form
M (b1, b2, ..., bN) (K)where M is a cell id used internally to keep track of the cells. It encodes the bin assignments for each property, which are listed next as N-tuples of integers (b1, b2, ..., bN). For example, cell id 3 in the illustration corresponds to bins (3,1,1,1,1,1,1,1,1) or: the third bin of the first property, first bin of the second property, and so on. The terms "first" and "second", etc., properties refer to property ordering as shown in the Properties list box in the Define Binning control panel. The number K shows the number of compounds in the cell.
The right-hand endpoints of the histogram bars are shown by cyan dots. Clicking them lists the cell contents in the text window (always) and displays the cell contents in a Cell Browser control panel (if Browse models was checked in the Bin Analysis control panel):
You can cycle through the molecules shown in the display window on the right side of the Cell Browser by clicking the triangular step-forward and -back buttons above the window. You can also select compounds for display by clicking the cells in the Name column of the table on the left side of the Cell Browser.
The Selected column in this table consists of No/Yes toggles. Compounds in rows marked Yes are added to existing row selections in the study table when Select Rows In Study Table is clicked.
Another library can be added to an existing histogram. On the Bin Analysis control panel, select the library to add and click the associated Add from library button. The new additions are marked yellow on the histogram, and the Cell Browser control panel displays a line of text showing the new library information.
The remaining controls on the Bin Analysis control panel are:
1. Start with an empty study table.
2. Select the descriptors you want to use.
3. Add these descriptors to the empty study table.
5. In the Add Molecules from SD File control panel, click the Preferences...
pushbutton.
This prevents structures from being maintained in memory.
File-based system
A file-based system for diversity analysis and QSAR methods facilitates working with large combinatorial libraries or datasets. Molecular descriptors are calculated and saved in compact, binary data files (BDF), which can be accessed directly by Diversity and QSAR methods in Cerius2 without having to load all data into memory at once.
Bdf files are generated by the following Cerius2 processes:
The following Diversity and QSAR methods can make use of bdf files:
You can generate bdf files in several ways:
The Analog Builder Preferences control panel allows you to specify the name of a bdf file to be generated as analogs in the combinatorial library are enumerated and sent to the study table to calculate molecular descriptors. All other Analog Builder preferences are in effect when generating a bdf file. Thus, to create a bdf file for a large library, turning on the options to delete the analog and the study table row after properties are calculated for the current analog results in the most efficient use of memory.
The SD File Preferences control panel allows you to specify the name of a bdf file to be generated when molecules are read from an .sd file into the study table and molecular descriptors are calculated. To minimize the memory requirements, the options to delete the model and the row after adding the molecule to the study table should be used.
Row names are used in file merge operations. This avoids duplicate rows in extend-row merges and avoids matching the wrong values from new columns in extend-column merges. In addition, merging bdf files generated from different SD files does not result in the SD indices being discarded. Each row in the merged file has its own SD index and SD filename.
You can export existing Cerius2 study tables to bdf files. Columns marked as independent (X) variables in the QSAR study table can be exported directly to a bdf file by selecting the File/Export to BDF menu item from the study table.
You can create bdf files from existing .dat datafiles. Go to the BINARY DATA FILES card in the COMBI-CHEM I card deck and select the Create BDF/From Data File menu item to open the Create Binary Data File control panel. You can use it to create a binary bdf file from an ASCII .dat file. All the columns in the .dat file are exported to the bdf file.
You can use bdf files for several purposes:
1. Diversity or QSAR analysis (1.)
3. Select diverse and select similar methods (3.)
4. Importing data into the study table (4.)
1. Selecting bdf files for Diversity or QSAR analysis
Once a bdf file has been generated, it can be selected for Diversity or QSAR methods by selecting the Select BDF menu item inon the BINARY DATA FILES card, which opens the Binary Data File control panel.
|
If data will be read from the study table (the Get Data from Binary Data File checkbox is unchecked), then the text area reads Data from Study Table:
|
To run principal component analysis (PCA) using the selected bdf file, just click the RUN buttonnext to the PCA popup in the study table. All the PCA options (set in the Statistical Method Preferences control panel) are valid when using bdf files, except for the option to create 3D plots for the samples (scores) and the descriptors (loadings). Upon successful completion of the PCA run, a new bdf file containing all the rows of the original bdf file, but with the original descriptors replaced by the principal components, is created.
Column selection is performed on a single bdf file. However, you can select additional files as collected input to one PCA run, by using the file list box in the BDF Preferences control panel. The output loadings can then be applied to each input bdf file in turn. You do not need to generate actual PC columns (with associated disk usage) to use the PCs. PC1, PC2, etc., that appear in the column-selection box when name of the .dep file (below) is input. These can be selected and used in any subsequent analysis, such as select-diverse.
Bdf-based PCA results are available as a set of loadings that can be applied to any other bdf file. These are in a file with the extension .dep. Click the Browse pushbutton near the bottom of the Binary Data File control panel to open the BDF Dependent File control panel, which enables you to load the dependent parameters and derivations. PCA loadings resulting from one bdf file can be applied to any other bdf file without performing any merge. Simply select the alternative bdf file in the Binary Data File control panel and the desired PCA output (.dep file) in the BDF Dependent File control panel.
To run select diverse (distance-based and cell-based) and select similar methods using bdf files, just make sure that the bdf file is selected and open the corresponding control panel (Select Diverse, Cell-Based Selection, and Select Similar, respectively) to carry out the selection. Options that apply to bdf files appear in the control panels.
The Export BDF to Table pushbutton near the bottom of the Binary Data File control panel opens the Export BDF data to Study Table control panel, which allows you to move data from the selected bdf file directly into the study table. Use the radio buttons to input all rows from the file, a range of rows, every Nth row, rows selected at random from the file, or rows specified in a bdf_rows file (see below).
If you want to plot only certain rows, rows can be selected by using a file of row numbers. Select the 3D Plot from BDF pushbutton on the Binary Data File control panel to open the 3Dplot from BDF control panael. Then choose the Rows from File radio button and enter a filename in the File entry box. The file should contain only row numbers, one per line. A file with the selected BDF rows is automatically created when you run any of these methods: select diverse (distance and cell based), select similar, R-group subsetting. This file is named bdf_rows and is placed in the run directory.
1. Before running the Diversity calculation (select diverse, select
similar, R-group subsetting) create a 3D plot for all the rows in
the bdf file.
A selected property can be used to color code the 3D plot by clicking the Color using Selected Property button in the 3Dplot from BDF control panel, then reselecting the three columns you want to plot.
Handling descriptors
You may be able to reduce the set of descriptors used in the analysis and therefore allow more compounds to be handled in the Study Table. The following procedure may be applied to select a reduced set of descriptors for analysis:1. Start from an empty Study Table.
2. Select the descriptors you want to use.
3. Add these descriptors to the empty Study Table.
5. Perform the PCA on the compound subset.
6. Calculate the correlation matrix between the descriptors and
the principal components.
8. Restart from an empty Study Table.
9. Add the reduced set of descriptors to the empty Study Table.
10. Import the full set of models for analysis.