| QSAR |

A model's ability to provide insight into the system is as important as its predictive ability. Possibly more valuable than being able to predict an activity or property is to know that it increases when a particular descriptor increases.
Finally, validation methods are needed to establish the predictiveness of a model on unseen data and to help determine the complexity of an equation that your amount of data justifies.
This chapter provides information about the following statistical analysis methods available in QSAR+:

Definition of principal components
Suppose the data are represented by a set of p variables X1, ..., Xp. Principal component analysis transforms this set of variables into a (preferably much smaller) set X´1, ..., X´k of linear combinations of the original variables {Xi}, which accounts for most of the variance of the original set. The new variables {X´j} are referred to as principal components and are usually presented in the order of decreasing contribution to the total variance.
Typically, the variables are known only by sampling their values by measurements performed on a collection of n objects. Let us denote the result of measuring the value of jth variable on ith object by Xij. Thus all measurements for all variables can be written in the form of an n X p matrix, which we'll refer to as the property matrix:
Eq. 1
This matrix is what the study table displays, with its columns corresponding to the variables (for example, molecular descriptors), and its rows corresponding to the models whose descriptor values are being measured or evaluated.
The first principal component, X´1, is defined as the linear combination:
Eq. 2
of the original variables {Xi} which maximizes the variance of X´1 subject to the constraint:
Eq. 3
It is convenient to think of the coefficients w1i as components of a column vector:
Eq. 4
where T denotes the vector (or matrix) transpose.
Subsequent principal components are defined analogously. The jth principal component X´j is the linear combination:
Eq. 5
whose variance is maximal under the constraints:
Eq. 6
It turns out that this constrained maximum problem has a solution that can be expressed in terms of the variance-covariance matrix of the original variables {Xi} as follows. Let
denote the sample mean of the variable Xi:
Eq. 7
and let Xm denote the mean-corrected property matrix:
Eq. 8
so that we can define the variance-covariance matrix S as:
Eq. 9
We denote the matrix elements of S by sij which are the sample covariances of Xi and Xj.
Then the coefficients {w1i} of the linear combination defining the first principal component are given as the components of the eigenvector w1 corresponding to the largest eigenvalue
1 of S. Similarly, the jth principal component is defined by the eigenvector wj corresponding to the jth largest eigenvalue of S.
It can be shown (Willett 1987; Everitt 1992) that the variances of the principal components are equal to the corresponding eigenvalues:
Eq. 10
and therefore the jth principal component accounts for the proportion
j/trace(S) of the total variance of the original variables {Xi}.
A variant of PCA in frequent use calculates the principal components from the variables {Xi} after they have been normalized to unit variance. When this is done, the variance-covariance matrix S becomes the correlation matrix R, whose elements are correlations:
Eq. 11
Thus R has ones on the main diagonal and trace(R) = p.
It is customary (Cerius2 conforms to this convention) to calculate principal components so they have zero means. In other words we define X´j as:
Eq. 12
|
For more information about this method, see Everitt and Dunn (1992); Willett (1987).
The goal of cluster analysis is to partition a dataset (typically representing a set of models in a molecular descriptor property space) into classes or categories consisting of elements of comparable similarity. While "similarity" is usually defined precisely, the notion of "comparable" cannot be defined completely since determination of what constitutes "comparable" is in fact a part of the answer cluster analysis seeks to determine. It is thus advisable to have several clustering analysis algorithms available, examining the data from different points of view.
Cluster analysis
The algorithms included in C2·QSAR+ are described below. They assume models are represented by points in multidimensional property space with Euclidean distance between points representing model dissimilarity. The description of each method begins with a list of user-defined parameters followed by the algorithm.
For more information about this method, see Everitt and Dunn (1992); Willett (1987).
Jarvis-Patrick clustering
Parameters
L -- number of nearest neighbors
N -- number of common neighbors
For each model a list of L nearest neighbors is prepared first. The set of models is then partitioned into clusters according to the following criterion: models X and Y are in the same cluster if all these conditions apply:
Variable-length Jarvis-Patrick clustering
Parameters
D -- distance threshold percentage
P -- common neighbors percentage
Mode flag -- either distance range or distance pairs
An enhancement of the previous method, designed to improve nearest-neighbor lists by including all models within user-specified dissimilarity range D. This parameter is understood to be a percentage of either:
In this way neighbor lists of varying lengths are prepared for all models. The models are subsequently partitioned into clusters according to the following criterion: Models X and Y are in the same cluster if:
N -- number of clusters desired
A fast algorithm with only one controlling parameter. The method first selects N models at random as "cluster centroids." Each model is then assigned to the closest centroid and the first clustering of an iterated sequence of clusterings is thus obtained. The iteration consists of two steps:
1. Calculate centroids of current clusters.
2. Assign each model to the closest centroid.
The iteration stops when the classification into clusters has stabilized.
For more information on this method, see Willett (1987).
Hierarchical cluster analysis (HCA)
Parameters
Number of clusters desired or objective function value/percentage (see below for definitions).
HCA methods are two-step procedures:
1. Prepare a nested family of potential clusterings,
2. Query this family and select a suitable clustering.
Step 1
The first step creates a hierarchical structure of clusterings whose visual representation is a dendrogram displaying levels of the hierarchy according to values of an "objective function." The objective function assigns a value to any given partition of the model set and it is this function that distinguishes the four HCA methods.
The nested family of clusterings begins with the set of models clustered into singletons. Each cluster pair is then examined in turn and the pair whose fusion yields the smallest value of the objective function is merged into a new cluster. This process is repeated until all models are merged into one cluster. The dendrogram displays each cluster merge plotted against the corresponding objective function value.
Objective functions of the four HCA methods for a given clustering are as follows.
Intracluster variances summed over clusters,
Smallest distance between clusters, where distance d(A, B) between clusters A and B is defined as:
Eq. 13
And dij is the distance (dissimilarity) between models i and j.
Smallest distance between clusters, where "distance" is defined as the maximum, over all distances between model pairs, one model from each cluster:
Smallest distance between clusters, where "distance" is defined as the arithmetic mean of all distances between model pairs, one model from each cluster.
Eq. 15
where nA and nB are the number of models in clusters A and B.
Step 2
In the second step the dendrogram is queried to select clusterings corresponding to specific values, or percentages of, the objective function. For example, in the simple dendrogram below (representing the result of an HCA performed on a set of five models), the clustering corresponding to the objective function value 4 consists of two groups: models labeled 0, 1, and models labelled 2, 3, 4 by the model counter:
An important tool in model building is regression analysis. An equation -- sometimes more than one -- is produced that relates activity (or other properties) to descriptors. There are two main goals: prediction and experimental design. It is useful to have a model that is predictive (even if imperfect) because it can be used for screening a large set of molecules or proposed molecules for promising candidates. A regression model might be even more useful if it suggests a previously unrecognized correlation between some property (or combination of properties) and activity. This is especially true if we know how to adjust that property by changing some substituent. This can lead to new experiments designed to increase understanding of the system under study.
Regression methods
There is no single method that works best for all problems and that has the perfect balance of predictiveness, interpretability, and computational efficiency. Some examples of these trade-offs would be:
|
The following methods all fit parameters to data so as to minimize the sum-of-squared residual errors. Some of them have the side effect of minimizing or maximizing other quantities at the same time.
|
Simple linear regression (simple)
A linear one-term equation is produced separately for each independent variable. This is useful for discovering some of the most important descriptors. However, it ignores the interaction of multiple descriptors.
Multiple linear regression (linear)
A single multiple-term linear equation is produced. This method requires at least as many molecules as independent variables. To produce reliable results, you typically need 5 times as many molecules as independent variables.
Stepwise multiple linear regression (stepwise)
A multiple-term linear equation is produced, but not all independent variables are used. Each variable is added to the equation in turn. A new regression is performed. The new term is retained only if the equation passes a test for significance (see Validation methods on page 41).
Principal components regression (PCR)
A multiple-term linear equation is created based on a principal-components analysis transformation of the independent variables (see Principal components analysis (PCA) on page 30). The components are chosen so that they retain the largest amount of variance of the independent variables if some of the components are discarded.
The first component is the direction of greatest variance in the independent variables. The next component is the direction of greatest variance that is orthogonal to all preceding independent variables.
Some of the last components are discarded to reduce the size of the model and avoid over-fitting. Normally the number of components kept is determined by crossvalidation (see Validation methods on page 41). Components are added in order, as long as they improve the crossvalidated r2. Variables that are co-linear are replaced by a single variable.
In effect, this method titrates the size of the model to the amount of data available. However, this method does not work well if some of the variables contain a lot of variance but do not correlate with activity (e.g., fingerprint-like descriptors). These variables are given a high loading in the components, pushing out others that are more relevant to activity. This means that, unless your independent variables are pre-screened for relevance, you should probably consider PLS instead.
Partial least squares (PLS)
This procedure is similar to PCR, but the dependent variables are transformed as well as the independent variables. Axes are chosen that maximize retention of variance and correlate dependent and independent variables. More specifically, the covariance of a transformed independent variable with a transformed dependent variable is maximized.
This represents a compromise in goals between maximizing variance and correlation with activity. More than one dependent variable can be present. See Glen et al. (1989) for more information. Because the transformed variables are very difficult to interpret, you should analyze the "loadings" of the original variables to find which are the most important.
A potential problem arises if the information in the descriptors is very diffuse. In this situation, an important transformed component consists of small contributions from a very large number of original variables. These variables might be overlooked during interpretation or in designing the next experiment even though cumulatively they are very important. This phenomenon is known as "loading spread".
Genetic function approximation (GFA)
In this method, models are collected that have a randomly chosen proper subset of the independent variables, and then the collected models are "evolved". A generation is the set of models resulting from performing multiple linear regression on each model; a selection of the best ones becomes the next generation. Cross-over operations are performed on these, which take some variables from each of two models to produce an offspring. In addition, the best model from the previous generation is retained. Besides linear terms there can also be spline, quadratic, and quadratic spline terms. These are added or deleted by mutation operations.
A major advantage of this approach is that a collection of diverse small models is generated that all have roughly the same high predictability. Each of these might provide a different insight into your system. Loading spread does not occur because, at most, one of a set of co-linear variables is retained in each model. This can make interpretation much easier than with PLS.
A disadvantage is that it takes too long to perform crossvalidation on each generation and, thus, you need to have a reasonable idea of how many terms to keep before you start. Another disadvantage, compared to PLS, is that if the information in your system is highly diffuse, you may need to retain more terms in each model than can be determined by the number of molecules. This happens sometimes with 3D QSAR data. See the section on genetic algorithms in Chapter 13, Genetic Function Approximation for more information.
Genetic partial least squares (G/PLS)
This method combines the best features of GFA and PLS.
Each generation has PLS applied to it instead of multiple linear regression, and so each model can have more terms in it without danger of over-fitting. G/PLS retains the ease of interpretation of GFA by back-transforming the PLS components to the original variables.
Once a regression equation is obtained, it is important to determine its reliability and significance. Several procedures are available to assist in this. These can be used to check that the size of the model is appropriate for the quantity of data available, as well as provide some estimate of how well the model can predict activity for new molecules.
Validation methods
Crossvalidation
The crossvalidation process repeats your regression many times on subsets of your data. Usually each molecule is left out in turn, and the r2 is computed using the predicted values of the missing molecules (the crossvalidated r2). Sometimes more than one molecule is left out at a time, though not all possible combinations of molecules are used, a concession that makes the computation tractable. If molecules are removed N at a time from a total set of M, then N X M regressions are performed.
Crossvalidation is often used to determine how large a model (number of terms) can be used for a given dataset. For instance, the number of components retained in PLS can be selected so as to give the highest crossvalidated r2. It is tempting to use this procedure to estimate expected errors in prediction on new molecules. However, since all the data are used in determining terms of the equation, this is not very reliable. If prediction is the main goal of the regression, then some of the data must be reserved strictly for test purposes until the equation is finalized. The reserved data cannot enter the regression process in any form.
Randomization test
Even with a large number of observations and a small number of terms, an equation can still have very poor predictive power. This can come about if the observations are not sufficiently independent of each other.
One way to test for this is by randomization of the dependent variables. The set of activity values is re-assigned randomly to different molecules, and a new regression is performed. This process is repeated many times. If the random models' activity prediction is comparable to the original equation, your set of observations is not sufficient to support your model.
Evaluating QSAR equations
Regression models can be compared using the total sum of squares about the mean,
. The total sum of the squares is composed of two parts:
Eq. 16
As the difference between observed Yi and predicted
approaches zero, or
, the sum of the squares due to regression approaches the sum of the squares about the mean. Thus, the sum of the squares of the residuals can be considered a measure of goodness of fit.
Use caution and do not rely too heavily on goodness of fit; it must have a large number of degrees of freedom associated with it to be used with confidence.
To compare relative values for the total sum of the squares, you must take into account the degrees of freedom associated with them:
When the total sum of the squares is corrected for degrees of freedom, the new term is called the corrected sum of the squares or the mean squares. This term is an important parameter used to evaluate the significance of regression models. Generally, its value should approach the error of measurement in the X data.
The F statistic is one of several variance-related parameters that can be used to compare two models differing by one or more variables. This statistic is used to determine whether a more complex model is significantly better than a less complex one.
The F statistic is computed using the following equation:
Eq. 17
Where:
1 degrees of freedom.
is the residual sum of the squares with
2 degrees of freedom.
is the level of confidence, usually 0.05, that the result is significant.
If X (independent) and Y (dependent) variables are highly correlated, X and Y contain much redundant information. The degree of correlation is measured by the correlation coefficient. You should also know the degree of correlation between independent variables because, in many calculations, it is assumed that they are uncorrelated.
The correlation coefficient is understood most easily using vector notation. If two vectors of the same length are correlated, the angle between them approaches zero and the cosine approaches one.
|
Using the notation from Figure 1, the normalized correlation coefficient between two variables (or vectors) X and Y is computed as:
Eq. 18
The correlation coefficient has direction. Variables that are positively correlated have 0 > r > 1, and those that are negatively correlated have -1< r < 0.
Correlation coefficients for the variables in a dataset are compiled in a correlation matrix. This matrix is a symmetric matrix in which the diagonal elements are one and the off-diagonal elements are the correlation coefficients for the appropriate variable pairs. Variables that are not correlated are said to be orthogonal, and their correlation coefficients are zero. It is assumed that independent variables are orthogonal. If they are not, an unstable regression model can result.
Square of the correlation coefficient
The most commonly quoted statistic used to describe the goodness of fit of data for a regression model is the square of the correlation coefficient, r2. This statistic is defined as:
Eq. 19
As the residual sum of the squares goes to zero, r2 goes to one, and the model gives good predictions. r2 can be computed using crossvalidation methods (CV r2) or bootstrap methods (BS r2). It is also the fraction of the variance explained by the model.
Internal validation uses the dataset from which the model is derived and checks for internal consistency. The procedure derives a new model using a reduced set of structural data. The new model is used to predict the activities of the molecules that were not included in the new-model set. This is repeated until all compounds have been deleted and predicted once. Internal validation is less rigorous than external validation.
External validation evaluates how well the equation generalizes. The original data are divided into two groups, the training set and the test set. The training set is used to derive a model, and the model is used to predict the activities of the test set members.