| QSAR |

By default, QSAR+ displays QSAR-related and statistical information for all entries in the study table. You can limit the information that is displayed in tables and graphs to data in selected rows of the study table. To do this, use the procedure for selecting observations described in Chapter 11, Working with Variables and Observations.
This section describes the procedures for selecting one of the statistical methods available in QSAR+ and setting the associated parameters. Each method is described briefly. For more information about the statistical methods, see Chapter 3, Theory: Statistical Methods, and Chapter 14, Using the Equation Viewer, and the statistics papers cited in the References appendix.
Selecting a statistical method
QSAR+ provides several different statistical methods for calculating QSAR equations:
The statistical method that is used to calculate a QSAR equation is determined by the entry that appears in the Method popup next to the RUN button on the study table or on the Statistical Methods Preferences control panel. A change in the selection made in either control panel is reflected in the other one.
Genetic function approximation
The genetic function approximation (GFA) algorithm can be used as an alternative to standard regression analysis for constructing QSAR equations. This method provides multiple models that are created by evolving random initial models using a genetic algorithm. Models are improved by performing a crossover operation to recombine terms of better scoring models. The method is good for generating QSAR equations when you are dealing with a large number of descriptors.
GFA can build linear and higher-order polynomials, splines, and other nonlinear equations. Using spline-based terms, GFA can perform a form of automatic outlier removal and classification. The GFA algorithm is packaged as a separate Cerius2 module, C2·Genetic Analysis (GA). For a complete description of this module, see Chapter 13, Genetic Function Approximation.
2. Indicate the number of generations to which the equations are
to be evolved.
3. If you want to adjust other parameters, click Configure GFA to
open the Configure GFA control panel. For detailed information
on preferences, see Chapter 13, Genetic Function Approximation
.
The default values for adjustable GFA parameters are appropriate for most situations. However, if you want to make changes before generating a QSAR equation:
Select Configure on the GENETIC ANALYSIS (GFA) card or select Configure GFA from the Statistical Method Preferences control panel. The Configure GFA control panel appears, which allows you to specify the types of terms to be used in equations, mutation probabilities, and other preferences for running the algorithm.
The genetic method produces a graph that shows the frequency that each term is used in all equations in the final population. For more information on this graph, Displaying statistical information for exploratory data analysis on page 228.
Genetic partial least squares (G/PLS)
Genetic partial least squares (G/PLS) is a variation of GFA that is derived from two methods: genetic function approximation (GFA) and partial least squares (pls). Both GFA and pls are valuable analytical tools for datasets that have more descriptors than samples.
The G/PLS algorithm is packaged as part of the Cerius2 module, C2·Genetic Analysis (GA). For a complete description of this module, see Chapter 13, Genetic Function Approximation.
2. If you want to adjust the default G/PLS parameters, click Configure
GFA. The Configure GFA control panel appears. For
detailed information on preferences, see Setting genetic analysis
preferences on page 240.
After selecting G/PLS from the study table Method popup, select Configure on the GENETIC ANALYSIS (GFA) card or select Configure GFA from the Statistical Method Preferences control panel.
Multiple linear regression
The multiple linear regression method calculates QSAR equations by performing standard multivariable regression calculations using multiple variables in a single equation. When you use multiple linear regression, you assume that the variables are independent (not correlated). Also, to minimize the possibility of chance correlations, the number of independent variables initially considered should not be more than one-fifth the number of compounds in the training sets -- a warning message box appears if this happens. When the number of independent variables is greater than the number of observations (rows), multiple linear regression cannot be applied.
To select the multiple linear regression method
Select LINEAR from the Statistical Method popup on the Statistical Method Preferences panel or select LINEAR from the Method popup at the top of the study table.
Partial least squares
The partial least squares (pls) regression method carries out regression using latent variables from the independent and dependent data that are along their axes of greatest variation and are most highly correlated. Pls can be used with more than one dependent variable. It is typically applied when the independent variables are correlated, or the number of independent variables exceeds the number of observations (rows). Under these conditions, it gives a more robust QSAR equation than multiple linear regression. For more detailed information, see the paper of Glen, Dunn, and Scott.
To select the partial least squares method
Select PLS from the Statistical Method popup on the Statistical Method Preferences pane or select PLS from the Method popup at the top of the study table.
Before generating a QSAR equation using the pls method, set the appropriate parameters:
2. Check one or more of the following checkboxes to indicate the operations that you want QSAR+ to perform:
Principal components analysis
The principal components analysis (PCA) method does not create a model but searches for relationships among the independent (X) variables. It then creates new variables (the principal components) which represent most of the information contained in the independent variables.
This method also creates two new models in the Model Manager: PCA Samples Plot and PCA Descriptor Plot. These display the interrelationships among the samples and descriptors in a visually intuitive manner. The samples plot plots each of the sample using the value of the sample in the first three principal components as the XYZ coordinates. The descriptors plot plots each of the descriptors using its contribution to each of the first three principal components as its XYZ coordinates. In these plots, samples or descriptors that are close are suggested to have little unique information, while samples or descriptors which are far from any other may contain unique information.
To select the principal components analysis method: Select PCA from the Statistical Method popup of the Statistical Method Preferences panel or select PCA from the Method popup at the top of the study table.
Before generating a QSAR equation using the PCA method, set the appropriate parameters:
2. Check one or more of these checkboxes to indicate the operations
that you want QSAR+ to execute:
This method creates two new models in the model manager: PCA Samples Plot and PCA Descriptor Plot. These display the interrelationships among the samples and descriptors in a visually intuitive manner. The samples plot plots each of the sample using the value of the sample in the first three principal components as the XYZ coordinates. The descriptors plot plots each of the descriptors using its contribution to each of the first three principal components as its XYZ coordinates. In these plots, samples or descriptors that are close are suggested to have little unique information, while samples or descriptors that are far from any other may contain unique information.
To select the principal components regression method: Select PCR from the Statistical Method popup in the Statistical Method Preferences control panel or select PCR from the Method popup at the top of the study table.
Before generating a QSAR equation using the PCR method, set the appropriate parameters:
2. Check one or more of these checkboxes to indicate the operations
that you want QSAR+ to execute:
Simple linear regression
The simple linear regression method performs a standard linear regression calculation to generate a set of QSAR equations that includes one equation for each independent variable. Each equation contains one variable from the descriptor set. This method is good for exploring simple relationships between structure and activity. The standard assumptions applied to multiple linear regression also should be satisfied when this method is used (see Multiple linear regression on page 210).
To select simple linear regression
Select SIMPLE from the Statistical Method popup on the Statistical Method Preferences control panel or select Simple from the Method popup at the top of the study table.
Before generating a QSAR equation using the simple linear regression method, make sure that the Plot Regression Equations checkbox is checked if you want all equations to be graphed. When calculations are complete, the graphs are displayed in a window that opens over the model window.
Stepwise multiple linear regression
The stepwise multiple linear regression method calculates QSAR equations by adding one variable at a time and testing each addition for significance. Only variables found to be significant are used in the QSAR equation. This regression method is especially useful when the number of variables is large and when the key descriptors are not known.
If the number of variables exceeds the number of structures, this method should not be used.
To select stepwise multiple linear regression
Select STEPWISE from the Statistical Method popup on the Statistical Method Preferences control panel or select STEPWISE from the Method popup at the top of the study table.
Before generating a QSAR equation using the stepwise multiple linear regression method, set the appropriate parameters:
3. Specify whether you want to run a Forward or Backward regression calculation. In Forward mode, the calculation begins with no variables and builds a model by entering one variable at a time into the equation. In Backward mode, the calculation begins with all variables included and drops variables one at a time until the calculation is complete. Backward regression calculations can lead to overfitting.
The DEFAULTS button returns all selections to their default values.
This section describes the options on the QSAR Preferences control panel that allow you to display detailed statistical results for generated QSAR equations. It also provides brief descriptions of the various statistics that are generated. For more information about the diagnostic statistics, see Chapter 3, Theory: Statistical Methods. 
Presenting QSAR statistical results
As a QSAR equation is calculated, values for a variety of statistical measures are also generated to help you evaluate the reliability and predictability of the equation. These diagnostic statistics are assembled in a number of tables, plots, and readouts that you can display at the end of a QSAR calculation or, in some cases, at other times when you want to view the data. Diagnostic statistical data displays include:
To open the QSAR Preferences control panel, select Preferences/General... on the study table menu bar or Preferences/General on the QSAR card.
By default, QSAR+ displays QSAR-related and statistical information for all entries in the study table. You can limit the information that is displayed in tables and plots to selected rows of the study table. For information on selecting rows (observations), see Chapter 11, Working with Variables and Observations.
To indicate that you do or do not want QSAR+ to generate a particular statistical data display, check or uncheck the appropriate checkboxes, which specify whether the table or plot associated with that box is displayed automatically at the end of a QSAR calculation.
You can click the ANOVA Table, Beta Coefficient Table, or QSAR Equations options anytime after the calculation of a QSAR equation is complete to display information in the tables.
Analysis of variance (ANOVA) table
The analysis of variance (ANOVA) table includes data from a standard sum of squares variance analysis for regression. This table is not generated if you use GFA as the method to generate the QSAR equation.
The ANOVA table includes the following columns:
Values are reported for the following parameters:
Equation viewer
When the QSAR Equation checkbox is checked, the Equation Viewer window is displayed at the end of a QSAR calculation. The equation viewer provides detailed information about each QSAR equation. An example of the equation viewer is:
If you select simple linear regression or GFA as your statistical method, QSAR+ generates a set of equations and places the best-scoring equation at the top of the list of equations in the equation viewer. You can use the equation viewer to sort the list of equations by various statistical properties, graph the equations, and so on.
Statistics that are reported in the equation viewer are a summary of those reported elsewhere (that is, in the ANOVA table, beta coefficient table, and Cerius2 text window). The type of statistics that appears depends on the method you have used to generate the QSAR equation.
If you use the GFA method, several unique parameters are reported:
Plots
Plots are displayed if their corresponding checkbox is selected in the QSAR Preferences control panel. Two types of plots are available as default options for all statistical methods when the appropriate box is checked: predicted-vs-observed activity plots and residuals plots. These plots are displayed at the end of a QSAR equation calculation.
Additionally, when genetic analysis is run, the software produces a plot of variable usage versus number of crossovers.
Predicted-vs- observed activity plot
The plot of predicted-vs-observed activity displays the actual activity (from non-QSAR sources) against the activity predicted by a QSAR equation. The data are plotted as a scatter plot, with each point representing one structure in the training set of structures. The QSAR equation is plotted as a regression line labeled Predicted = Observed. A sample plot is:
The Residuals plot displays the residuals (that is, the differences between predicted and observed activities) for the current QSAR equation and set of structures. This plot is a histogram, plotting residual values against observations, each observation representing the data for a single structure. The observation number corresponds to a row number in the study table. A sample plot is:
Variable usage vs. # of crossovers plot
The plot of variable usage vs. number of crossovers shows the frequency that each variable (that is, descriptor) is used in all the models in the final equation population. To display this plot, open the QSAR Preferences control panel before you run your QSAR calculation and uncheck the Predicted vs. Observed checkbox. This prevents overwriting of the variable usage plot by a plot of predicted-vs-observed activity. A sample variable usage plot is:
The overall objective of a QSAR procedure is to derive a model that is optimally predictive. That is, the model should provide a reliable estimate of the activity of new or untested compounds similar to those in it. A model that does a good job predicting the activities of compounds on which it is based must be tested to see if any of the data in the test set are data that affect the model excessively. This is done using the QSAR validation procedure.
Validating QSAR equations and data
The default validation procedure uses the dataset from which the model is derived and check the data for internal consistency. The procedure derives a new model using a reduced set of observations (rows). Each time a new equation is generated, one row is excluded from the calculation. Each new equation is used to predict the activity of the molecule that was not included in the new-model set. This is repeated until all compounds have been deleted and predicted only once.
When the validation procedure is complete, five statistics (as defined on page 218 and page 219) are calculated and added to the beta coefficient table:
You can use these diagnostic statistics to judge the quality of your original QSAR equation and, using outlier information, to modify and improve that equation (Working with outliers on page 227).
Setting the default validation option
Each QSAR equation can be validated automatically after it is generated. If the validation option is selected, QSAR+ displays diagnostic statistics in the Cerius2 text window and in the beta coefficient table after an equation is generated. The statistics that are displayed are described in Beta coefficient table on page 218.
To have equations automatically validated
Check the Auto-Validate QSAR Calculation checkbox in the QSAR Preferences control panel.
If the Auto-Validate QSAR Calculation checkbox is unchecked, you can check it after a QSAR equation is generated to validate that equation and to set the automatic validation default for future equations.
Using other validation procedures
If you choose not to automatically validate QSAR equations, you can still run validation on a current QSAR equation without changing the default validation option. You can also run several other procedures to test the validity of your model.
To run a validation procedure, click the Validate QSAR icon on the study table tool bar.
The Validate control panel appears.It offers three validation choices:
a. Repeatedly scrambling the activity data in the study table.
b. Using the randomized data to generate QSAR equations.
-- Number of the trial.
-- Number of random trials.
-- Trial number.
2. Crossvalidation Test -- If you click this button, crossvalidation is performed. This process leaves out the number of samples you specify in the -Fold entry box for each crossvalidation run. In terms of the calculation, the choices are leave-one-out or leave-n-out.
-- Observed activity.
-- PRESS.
3. Validate QSAR Model -- If you click this button, QSAR+ runs the default validation procedure. When this calculation is complete, QSAR+ prints the following information in the text window and adds it to the beta coefficient table:
-- Information about outliers.
An outlier is a model that is not predicted well by a QSAR equation. As part of the validation process (described in the previous section), QSAR+ generates information about outliers and also highlights outlier rows in the study table. 
Working with outliers
You can use the information that is generated about outliers to remove them iteratively from the QSAR equation, then recalculate the equation until you are satisfied with the results.
Before you can work with outliers, you must have a validated QSAR equation. The validation process identifies outliers and generates diagnostic data that help you make decisions about them.
By default, QSAR+ displays outlier information for the entire study table. You can limit the display of outlier information to data in selected rows of the study table. To do this, use the procedures for selecting observations described in Chapter 11, Working with Variables and Observations.
To remove outliers, click the Outlier icon on the study table toolbar.
Outliers are removed from the observations used to calculate the QSAR equation and a new equation is generated. QSAR+ removes the outlier rows only from the observations used to calculate the QSAR equation; QSAR+ does not delete the rows from the study table.
QSAR+ provides a variety of statistical data about the training set, the independent variables, and the values that are calculated as part of that set. This information can be used to explore and analyze your data.
Displaying statistical information for exploratory data analysis
By default, QSAR+ displays QSAR-related data and statistical information for all entries in the study table. You can limit the information that is displayed in tables and plots to data in selected rows of the study table. To do this, use the procedures for selecting observations described in Chapter 11, Working with Variables and Observations.
Displaying a correlation matrix
The correlation matrix is a Cerius2 table that shows the correlation of one descriptor with another. Using the data in this table, you can examine relationships among descriptors. You also can use the data to modify and improve your QSAR equations. For example, if a QSAR equation has two descriptors that are strongly correlated (as indicated by values near 1 or -1), only one of the descriptors is needed to define the equation. You can modify the QSAR by discarding one of the descriptors and generating a new equation (see Chapter 7, Working with Descriptors). Here is a sample correlation matrix:
To display a correlation matrix, click the Correlation Matrix (Corr) icon on the study table toolbar.
Displaying a descriptive statistics table
The descriptive statistics table contains data that summarize specific features of the independent variable data set. For each independent variable and for each activity in the table, QSAR lists:
Kurtosis is thickness of the tails of a distribution curve and the term skewness refers to how symmetric the distribution of values is.
The parameters that are included in this table provide insight into the variables that are used in a QSAR equation. Here is a sample descriptive statistics table.
To display a descriptive statistics table, click the Summary Statistics icon on the study table toolbar.
Displaying rune plots
You can generate a rune plot for all variables (columns) in the study table. A rune plot visualizes the distribution of values for the specified variable. Generally, normally-distributed data results in more meaningful QSAR equations. Here is a sample rune plot:
To display a rune plot, click the Runes icon on the study table toolbar. Each rune is displayed in a different color and represents one independent variable. The color code is listed in the upper-right corner of the plot window.