[Top] [Index]
Generate Hypothesis Workbench
Contents:
Descriptions of menu items used in hypothesis generation:
Overview of hypothesis generation
Catalyst allows you to use structure and activity data for a set of lead compounds to create a hypothesis characterizing the activity of the lead set. A good hypothesis can increase your understanding of the activity of your lead compounds and can be useful for evaluating other similar compounds, while providing insight into new avenues of drug discovery. A successful hypothesis can help you understand the relative importance of different features of a set of leads and provide ideas for future research.
With a hypothesis you can predict the activities of other compounds having the same receptor binding mechanism and score how well the hypothesis explains the activities of each molecule in your training set. You can display and evaluate in three dimensions how the significant features of your lead compounds fit your hypothesis.
Catalyst includes two primary methods for creating hypotheses. The first method is to interactively build a hypothesis by one of several procedures in the View Hypothesis workbench, based upon your knowledge of the significant chemical functions and fragments constituting your target compound. These procedures are explained under To build a hypothesis.
The second method is for Catalyst to automatically generate a hypothesis from a diverse set of lead compounds that have activity data from the same assay. This section describes the automatic tools and the procedures you use in the Generate Hypothesis workbench.
You can run automatic hypothesis generation as a batch job in the background on your computer or another computer on the network. This allows you to use Catalyst for other activities while the background job is running. If you run a background job on a computer that is also being used interactively, memory and processing power must be shared, so operations within the interactive Catalyst session will be slower. If you run the background job on another computer that is not being used for an interactive session, the interactive Catalyst session is not affected.
You can request that Catalyst consider up to five types of functions from the Feature Dictionary for generating a hypothesis. The eleven predefined chemical functions for automatically generating hypotheses are the same ones used in the View Hypothesis workbench for constructing hypotheses:
- Hydrogen-bond acceptor
- Hydrogen-bond acceptor (lipid)
- Hydrogen-bond donor
- Hydrophobic
- Hydrophobic (aliphatic)
- Hydrophobic (aromatic)
- Negative charge
- Negative ionizable
- Positive charge
- Positive ionizable
- Ring aromatic
Please see Description of Catalyst's predefined chemical functions.
You can modify the definitions of the predefined chemical functions and use them in generating hypotheses. However, database searching is much slower when a hypothesis includes user-defined functions.
In general, the best hypotheses have five features, but only two or three types of features. If possible, you should use what you know about the characteristics of your lead-compound set to reduce the number of types of features considered during hypothesis generation.
Automatic generation of hypotheses includes three phases:
- Prepare your set of lead compounds and generate conformers.
- Prepare your input spreadsheet by entering the lead compounds and their activity data.
- Set up a background process for hypothesis generation.
Preparation for hypothesis generation
For a general description of how to use hypotheses, see Introduction to hypotheses and constraints. Once you understand how you can use hypotheses, you can prepare to generate a hypothesis automatically from a training set of compounds, as follows:
- Create a lab. Use the Create Lab menu item in the Stockroom to create a place to store your training set. Give the lab a unique name. See Stockroom and labs for more information.
- Get training set compounds into Catalyst. Use one of these methods:
- Build the structures in the View Compound workbench (see To build a molecule).
- If you have converted your company database to Catalyst format and your training set compounds are included in this database, search for the compounds either by name with the Data/Find menu item or with the Tools/Fast Flexible Search Databases/Spreadsheets menu item in the View Database workbench (see Introduction for a comprehensive list of topics on building and maintaining databases).
- Save lead compounds from other database programs as MOL-format files. Or look for lead compounds previously exported from Catalyst (Export menu item). Then use the Import menu item to bring the compounds into Catalyst (see To import objects one at a time).
- Run Catalyst on an X-terminal emulation on a Macintosh computer also running ChemDraw PlusTM. Copy SMILES strings for each lead molecule in ChemDraw (or another Mac application), then use the Catalyst Paste menu item to bring them into the View Compound workspace (see Displaying Catalyst on a Macintosh X terminal).
- Save each of the lead compounds in the lab you created.
- Check functions and ionization state. The standard functions listed in the Feature Dictionary and used by default for generating hypotheses assume a pH of 7.0. Use the Show Function Mapping menu item in View Hypothesis on each training set molecule to make sure that each function you intend to use maps the training set molecules properly and that the ionization state is appropriate. If the function is not identified properly, you can edit the function definition in the View Hypothesis workbench (see To build a hypothesis).
If a lead molecule has a different ionization state in its active form, you can create the ionized molecule by editing in the View Compound workbench (see Ionizing and deionizing atoms).
- Display appearance. Use the Tools/2D Beautify and Tools/3D Minimize menu items to correct the display of each molecule. Be sure to save any changes if you want them to be permanent.
- Structure and stereochemistry. Check that the 3D structure and stereochemistry are correct for each compound before proceeding to the next step. If the molecule is distorted or has incorrect stereochemistry, make the necessary corrections. Use the Set Stereochemistry tool in the View Compound toolbox (see How to use the toolbox). Save each modified compound to the shelf.
- Generate a conformational model. Use the Tools/Generate Conformational Model menu item of the View Compound workbench. You should use Best conformer-generation mode, which for 80 conformers could take approximately 30 min for compounds of about 80 atoms. The conformer-generation program automatically creates the number of conformers needed to cover the conformational space of the molecule (see Introduction to conformer generation).
Upon completing conformer generation in Catalyst, the new conformers become part of the compound and can be viewed and evaluated with the Tools/Show Conformational Model menu item. If conformers are generated in a background process, they can be brought into Catalyst using the Data/Process Information menu item.
- Now you are ready to set up your Generate Hypothesis workbench (see Introduction to the tabular report/spreadsheet).
To set up and generate a hypothesis automatically
After preparing your lead set of molecules, generating a conformational model for each one (see Preparation for hypothesis generation), and putting activity data into your spreadsheet, you choose the functions to be used and then set up a background process to generate a hypothesis:
- In the Generate Hypothesis workbench, check that your spreadsheet lists all compounds that you want to use for generation and that their tested activities and uncertainties are correct. This input spreadsheet should be saved to the shelf.
The compounds in your lead set do not need to be on the shelf in the Generate Hypothesis workbench, but they must be in a lab or the Stockroom.
- Select (from the shelf) the saved spreadsheet containing the lead set compounds and their activity data.
- Select the Tools/Generate Hypothesis menu item. A control panel appears.
- Enter the name you want to give to the generated hypothesis in the Output Hypothesis entry box.
The next step is to choose the set of chemical functions that you want to be considered during hypothesis generation. Use the Feature Selection portion of the control panel (shown below) to choose a maximum of five types.

- If you know which types of functions are likely to be a significant part of your hypothesis, select them one at a time from the Dictionary list box and then click the Add button.
The hypothesis-generation program will analyze various combinations of function types (within the maximum and minimum limits set) and choose the combination of functions that best accounts for the structures and activities in the lead set. The time required to run the generation program increases substantially as the types of functions increases, so choose as few function types as possible that can still provide a good description.
One way of identifying the "important" functions is to drag the most active training set compound into the View Hypothesis workbench and use the Show Function Mapping menu item to indicate what functions are represented in the molecule. This can be used as an indicator of which functions may be important in contributing to the activity of this and other molecules (see the Show Function Mapping menu item description).
Recommendations for choosing functions for the hypothesis. Choose the NEG CHARGE function if you have full negative charges in your training set molecules and the NEG IONIZABLE function if you have protonated acidic functions in your training set, but not both types. In a similar manner, use either POS CHARGE or POS IONIZABLE function, but not both. Also, use either HB ACCEPTOR or HB ACCEPTOR (lipid) function, but not both.
As you select each function, its name appears in the Selected Function Definitions list box. Next to the name are default values for the minimum and maximum number of instances of the function allowed in the hypothesis.
- If you want to delete a function from the selection list, select its name on the Selected Function Definitions list and then click Remove.
- To change the default maximum or minimum number of instances for a function that you want to include in the hypothesis, select the Edit button.
The Feature Editor control panel appears, with the present values of Minimum count and Maximum count.
A minimum count of 1 means that you want at least one of a particular function. You can limit the number of a given type in the generated hypothesis. For example, you could specify a maximum count of 2 for HB DONOR functions.
(Generally the defaults are appropriate for the initial hypothesis.)
Limit on location constraints in hypothesis generation. It is important to understand that you may have up to five features in a generated hypothesis and that each feature must have at least one location constraint. But the maximum number of location constraints in a generated hypothesis is seven. In addition, to properly characterize vector features such as hydrogen-bond donors and acceptors, they must have a location constraint at both the heavy atom location and projected point location. Therefore, if you have two vector features with a total of four location constraints, you can have only three other nonvector features to stay within the limit of seven location constraints.
Be aware also that the simplest generated hypotheses are 1) a null hypothesis, which consists of an average activity estimate and no functions, and 2) a hypothesis with four location constraints (two vector functions or four nonvector functions, for example).
- Specify the Total Features in the hypothesis by entering the minimum (Min) and maximum (Max) limits. The hypothesis generated will not have fewer than the Min value (range 1-5) nor more than the Max value (range 1-5).
- Hypothesis generation has the following parameters that can be reviewed and modified after selecting the More Hypothesis Options button:
- WeightVariation. The default value of this parameter is 0.302, representing the expected standard deviation of the feature weights for "good" hypotheses. Higher values for this parameter favor hypotheses with feature weights that differ widely from the ideal value, which is currently 2.0 for each feature weight. Lower values favor hypotheses with feature weights that differ little from the ideal value of 2.0. If you set this parameter too high, hypothesis generation could produce hypotheses with feature weights that are too large to be chemically reasonable and more likely to overfit. In effect, this parameter represents the penalty for a feature's weight deviating from the expected weight and restricts an "unreasonable" contribution to estimated activity by any particular feature. Feature weights represent the orders of magnitude of activity to which any hypothesis feature contributes. All features have the same weight in generated hypotheses.
- MappingCoeff. The default value of this parameter is 0, which means that it is turned off and is not used in hypothesis generation. The MappingCoeff is a penalty for hypotheses for which training set molecules that are topologically similar map in different ways to the hypothesis. Increasing the MappingCoeff to values greater than 0 favors hypotheses for which topologically similar molecules map to the hypothesis in similar ways. We recommend that you use the default value.
- Spacing. The default value of this parameter is 297 pm (picometers). The Spacing parameter lets you specify the minimum distance between actual feature locations in training-set molecules to identify candidate hypotheses. For example, for a negative-charge feature this location is the position of the charged atom in a conformer. To be considered a candidate, a hypothesis must fit the most active molecules in the training set at least partially. Only configurations of features in the molecules with at least this distance between actual feature locations are considered when identifying candidate hypotheses.
The default of 297 pm (2.97 Å) prevents both oxygens in a carboxyl group from being simultaneously assigned as hydrogen-bond acceptors. The default value works well for most medium-to-large molecules, but is not good for small molecules that do not have many features. If you are interested in hypotheses where features are close together, you should set this parameter to a small number such as 5 pm.
- MinPoints. The default value of this parameter is 4, meaning a minimum of 4 individual feature components for a generated hypothesis. For example, a hypothesis with two hydrophobes and two negative ionizables has four points. A hypothesis with two hydrophobes and one hydrogen-bond acceptor also has four points, since the hydrogen-bond acceptor feature has both an atom and a projected point. The default value of 4 is the minimum value for an enantio-selective hypothesis. However, for very small rigid molecules that have few chemical features, you might need to set MinPoints to 3.
If MinPoints or MinSubsetPoints (below) is set too high for your training set, the hypothesis-generation job will terminate within 30 min, returning only the null hypothesis in the log file. If this happens, you should re-examine your training set and experiment with a lower setting for MinPoints or MinSubsetPoints. For example, if the hypothesis-generation run gives only the null hypothesis when both MinPoints and MinSubsetPoints are set to 4, you should try changing the setting for MinSubsetPoints to 3. If your training set contains small rigid molecules that have no more than 4 features in one of its most active compounds, hypotheses cannot be generated unless the value of the MinSubsetPoints parameter is decreased to 3. You might even have to lower MinPoints to 3 and MinSubsetPoints to 2 in order to generate hypotheses from a training set in which the most active lead has only 3 features. To generate hypotheses from such a training set, all other compounds among the top ten most active must have at least 2 features in common.
- MinSubsetPoints. The default value for this parameter is 4. To be considered, a hypothesis must fit the most active molecules in the lead set at least partially. Only configurations of features in lead set molecules with at least the number of points specified by this parameter are considered when identifying a candidate hypothesis. The default value works well for most training sets, but if the molecules are small, rigid, and have few features, you might need to change this value to 3.
- VariableWeight. This parameter is used to select the variable weight mode of HypoGen. The default is 0 or standard mode. Set this value to 1 to use the variable weight mode. In this mode, HypoGen will allow the individual feature weights to vary during the optimization.
- VariableTolerance. This parameter is used to select the variable tolerance mode of HypoGen. The default is 0 or standard mode. Set this value to 1 if you wish to use the variable tolerance mode. In this mode, HypoGen allows the individual feature tolerances to vary during optimization.
All hypothesis-generation parameters have default values optimized for an initial hypothesis generation process and should not be changed until you review at least one set of results.
- The next part of the setup process is to choose the computer on which to run the background task and the time to start, using the Job Options part of the control panel:

Keep in mind that a simple hypothesis-generation run can take more than 12 hours on an Indigo-class computer. If you generate a hypothesis in the background on a computer while it is being used interactively, the demands of hypothesis generation will noticeably slow other operations.
Choose to run on your computer by selecting the Locally button. Or choose to run on a different computer on the network by selecting the Remotely on button. The names of the available computers on the network are listed in the list box.
When you select a host computer, its name is highlighted on the list and displayed in the Remote Host text box.
- Enter values for Start Time, Queue After, Process Name, and Local Directory, as described in below:
- Start Time. You can specify a time for the process to begin, by entering a starting time or by selecting a process name in the Queue After list. To specify a start time, enter the value as a time, optionally followed by a date.
Format for Start Time. You can give the time as an hour or as an hour and minutes. Give just the hour as a one- or two-digit number. Give the time in hours and minutes as a four-digit number. You can also separate the hours and minutes by a colon, using as many digits as needed. You can optionally add am or pm after the time. If you do not do so, the time is assumed to be in 24-hour clock time. You can also give the time as now, noon, or midnight.
Format for date. A date in the Start Time box is optional. If you do not give it, Catalyst interprets the date to be the next time the specified time occurs. The date can be any month and day combination, a day of the week, today, or tomorrow.

Both the month and the day of the week can be spelled out in full or as a three-letter abbreviation. The month and day are not case sensitive; that is, you can type either upper- or lowercase characters when you specify months and days.
- Queue After. To schedule a background process to run after a previously scheduled background job has finished, select it by clicking its name in the Queue After list. If you do not select a process name in the Queue After list, Catalyst will run the job at the time you specify in the Start Time text box.
The Queue After list shows each job that has been scheduled to start at a particular time or has been queued to start after some other job and which has no job scheduled to run after it. It does not list scheduled jobs that already have jobs scheduled to follow them. To set the current scheduling, the correct process must be selected and highlighted in the Queue After list when you click the OK button at the end of the setup process. (When a task is properly selected to queue after another job, any time entered in the Start Time text box is ignored.)
To deselect a selected process in the Queue After list, click the blank space at the end of the list.
- Process Name. Catalyst displays a unique default name for the process, but you can change it. Click in the text box to make it active, press the <Backspace> or <Delete> key to remove unwanted characters, and enter your name for the background process.
- Temporary Directory. Catalyst creates a new subdirectory in your current directory and gives it the name in the Temporary Directory box. If you select Run Remotely, Catalyst also creates a subdirectory in the specified remote directory and gives it the same name. Catalyst places all the files necessary for running the background process and then stores the results in the temporary directory on the host computer. Change the name if you want a different designation.
- When the parameters are correctly entered, select the OK button. If you want to close the control panel without doing anything, select the Cancel button.
- Recheck that all parameters are correctly set in the Generate Hypothesis control panel, then select the Generate button. The necessary directory and files are set up to run the HypoGen program. During the setup time, even if you sent the job to a remote computer, the Catalyst interface is grayed out and the cursor is displayed as a clock symbol. When the setup process successfully completes, an Alert message informs you that the setup is completed and the process will be started at the requested time. (See When a background process starts for additional information.)
To check the status of the background process and manage the data generated, see the Process Information menu item.
- HypoGen calculates several cost parameters that can be used to estimate
the likelihood of generating valid hypothesis models. These cost values are
generated within the first 15 min of the job and can be viewed in
the *.log and *.full files found in the run directory.
Quitting Catalyst when a background process is scheduled or running
After setting up a background process, you can exit Catalyst and even log off your computer. That is, terminating your Catalyst session or logging out does not terminate the background process, nor does the background process require you to be logged on or to have an active Catalyst session for it to run as scheduled.
Items in the Generate Hypothesis workbench
The Generate Hypothesis workbench provides tools that can generate and analyze hypotheses and compare them with sets of lead compounds. The functions of the parts of the workbench are described here.
Shelf
The shelf in the Generate Hypothesis workbench holds hypotheses, compounds, and spreadsheets needed in hypothesis-generation operations.
Menu bar
The Generate Hypothesis workbench's menu bar provides functions appropriate for generating and analyzing hypotheses:
Status area
The status area of the Generate Hypothesis workbench provides brief reports on the status and results of operations in the workbench, including reports on the number of entries in the tabular report/spreadsheet.
3D workspace
The 3D workspace in the Generate Hypothesis workbench is a display area for viewing and analyzing compounds and hypotheses. You can also do some editing of hypotheses in the workspace.
Toolbox
The toolbox of the Generate Hypothesis workbench provides tools relevant to working with hypotheses: Tether, Measure, Erase, Fit to Window, Tile Objects.
QuickTool area
The QuickTool area of the Generate Hypothesis workbench enables you to use the View Hypothesis QuickTool. See Introduction to the QuickTools.
Entry boxes
The Edit, Set Activity, and Current Hypothesis entry boxes are located between the 3D workspace and the tabular report, as is the Set Activity button. They are used as follows:
Edit entry box
Use the Edit entry box for entering and modifying data in the tabular report/spreadsheet, as follows:
- To modify data cells in tabular reports, click the data cell that you want to change. Its current value is shown in the Edit entry box.
- Click in the Edit entry box. The cursor changes from a carat to a blinking vertical bar.
- Delete the displayed data in the Edit entry box and type in the new data.
- Press the <Enter> key on the keyboard to enter the new data in the cell.
Set Activity entry box and button
The Set Activity controls allow you to select and use a different activity property for which you have training-set data. By default, the value in the Activ box is used for Score, Regress, and Generate Hypothesis actions. To use a different value for an activity or a different name:
- First add the new activity property to your Property Dictionary, using the Databases/Edit Property Dictionary menu item in the Stockroom.
- Then, to change the activity property to be used, click the Set Activity button in the Generate Hypothesis workbench.
The Set Activity Property control panel appears.
- Click the name of the new activity property in the list box.
- Select the Set button. The control panel is closed and the name of the new activity property is shown in the text box as the current activity property.
Current Hypothesis
The Current Hypothesis entry box displays the name of the hypothesis to be used in fit operations, if one has been selected. The current hypothesis selection is used by the Tools/Show Selected Compounds/Mappings menu item, which can also be called by double-clicking one of the rows in the spreadsheet. (Other actions that require a hypothesis selection look for a selected hypothesis on the shelf.)
To select a current hypothesis, drag a saved hypothesis and drop it into the entry box. Its name is displayed, and it now can be used for quickly performing fits to compounds in your spreadsheet.
To remove a current hypothesis designation, use the Edit/Clear Current Hypothesis menu item.
Tabular report/spreadsheet
The unique features of the Generate Hypothesis workbench are related to its tabular report and spreadsheet format for input and output of compound and hypothesis information:

The data cells are arranged in columns in the Generate Hypothesis tabular report/spreadsheet. Some columns contain fixed cells, which means they always display the same property and you cannot modify them (such as Row and Name).
Some columns contain variable cells, which you can change to display a different property (such as Activ, Uncert, and Mol Wt). To display a different property in a variable-cell column, double-click the column header. If the cell is variable, a Change Report Property control panel appears. Select the new property to be displayed in the column from the list of properties and then click the Change button. The control panel is closed and the new property name appears in the column header in the tabular report.
The default cells in the tabular report are defined and used as follows:
- Row. Number of the row in the report (automatically entered with compound).To select a row, click the row number.
To display the compound in the 3D workspace, double-click the appropriate row number (if no current hypothesis is selected). If a current hypothesis is selected (its name appears in the Current Hypothesis text box), double-click the row number to perform a quick Compare/Fit on the compound with the current hypothesis. See Comparing a compound and a hypothesis.
- Name. Compound name (automatically entered when the compound is dragged into the report).
- Activ. The biological activity of the compound (automatically entered from StockroomDB, if available, or typed in as input data).
- Uncert. The uncertainty in the value of the compound activity, a ratio of the reported value to the minimum and maximum values (automatically entered as default of 3.00).
- Color. Unique color applied to the compound for identification when displayed during fits with hypotheses in 3D workspace (automatically assigned when the compound is entered, but the color can be changed by clicking a color cell and using the color wheel control panel).
- Estimate. The estimated activity of the compound based upon the generated hypothesis (output from program).
- Error. A measure of how accurately the hypothesis estimates the activity of the compound. Computed as the ratio of the tested activity to the activity estimated by the hypothesis or the inverse if Est is greater than Activ (output from program).
- Mol Wt. Molecular weight of compound (automatically computed and entered).
- Mol Formula. Chemical formula of compound (automatically computed).
How to use the workbench
To open a Generate Hypothesis workbench
To open a Generate Hypothesis workbench, click the
Generate Hypothesis button in the toolbar.
Setting up for and generating hypotheses
Using Hypotheses
Other Tasks
[Top] [Index]
Last updated April 2000.
Copyright © 1997-2000 Molecular Simulations, Inc. All rights
reserved.