MSI Product Previous Next Contents Index Top
QSAR



13       Genetic Function Approximation

The genetic function approximation (GFA) algorithm (Rogers and Hopfinger) offers a new approach to the problem of building quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) models. Replacing regression analysis with the GFA algorithm allows the construction of models competitive with or superior to those produced by standard techniques and makes available additional information not provided by other techniques. Unlike most other analysis algorithms, GFA provides you with multiple models, where the populations of the models are created by evolving random initial models using a genetic algorithm. GFA can build models using not only linear polynomials but also higher-order polynomials, splines, and other nonlinear functions.

This chapter describes

Overview of genetic function approximation

The genetic function approximation algorithm was initially conceived by taking inspiration from two seemingly disparate algorithms: Holland's genetic algorithm (1975) and Friedman's (1990) multivariate adaptive regression splines (MARS) algorithm.

Genetic algorithms are derived from an analogy with the spread of mutations in a population. In this analogy, "individuals" are represented as a one-dimensional string of bits. An initial population of individuals is created, usually with random initial bits. A fitness function is used to estimate the "quality" of an individual, so that the "best" individuals receive the best fitness scores. Individuals with the best scores are more likely to propagate their genetic material to offspring through crossover, in which pieces of genetic material are taken from each parent and recombined to create the child. After many such mating steps, the average fitness of the individuals in the population shoule increase, as good combinations of genes are discovered and spread through the population. Genetic algorithms are especially good at searching problem spaces having a large number of dimensions, since they conduct a very efficient, directed sampling of the large space of possibilities.

Friedman's MARS algorithm is a statistical technique for modeling data. It provides an error measure, called the lack-of-fit (LOF) score, that automatically penalizes models with too many features. It also inspired the use of splines as a powerful tool for nonlinear modeling.

The GFA algorithm uses a genetic algorithm to perform a search over the space of possible QSAR/QSPR models using the LOF score to estimate the fitness of each model. Such evolution of a population of randomly constructed models leads to the discovery of highly predictive QSARs/QSPRs.

The GFA algorithm approach has a number of important advantages over other techniques:

Using genetic partial least squares

The genetic partial least squares (G/PLS) algorithm is included in the genetic analysis module as an alternative to a GFA calculation. G/PLS is derived from two QSAR calculation methods: GFA and partial least squares (pls).

The G/PLS algorithm uses GFA to select appropriate basis functions to be used in a model of the data and pls regression to weight the basis functions' relative contributions in the final model. Application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables.

G/PLS is run in the same way as GFA. To set up a G/PLS calculation, you must specify the G/PLS method and certain other settings in the Configure GFA control panel. For more information, see Setting genetic analysis preferences on page 240.

Starting a genetic analysis

Packaged as a separate Cerius2 module, C2·Genetic Analysis can be used as another of the statistical methods available in QSAR+ for generating QSAR equations.

Before you begin

To start a genetic analysis

To use GFA as the statistical method for a QSAR analysis, choose GFA from the Method popup at the top of the study table, then click the Run button on the QSAR tool bar. Configuring GFA options is described in Genetic function approximation on page 208.

Performing a genetic analysis

When C2·Genetic Analysis is installed, the parameters that control the processing performed by the GFA algorithm are set to do a reasonable job of building predictive equations. This section describes the processing that occurs by default when you start a genetic analysis as described in the previous section. This section also points you to other chapters that explain how you can change the built-in settings and control the genetic analysis process.

Performing a genetic analysis using the built-in settings consists of five basic steps.

1. Starting the analysis

Before you start the genetic analysis:

The easiest way to start the analysis is to select GFA as your statistical method and click the Run icon on the QSAR tool bar. This starts a genetic analysis using the default settings.

2. Building the initial population

The analysis begins by building a population of 100 randomly constructed equations. These random equations are displayed in the equation viewer.

You can change the size of the initial equation population, as well as change both the number and type of terms to be used in each of these initial equations. For more information, see Setting genetic analysis preferences.

3. Evolving the population

The initial population is then evolved for 5000 generations. Evolving the population means that, for each generation, two better-scoring equations are selected as parents. Parts of each parent equation are then used to create a child equation. Optional equation-mutation operations may be performed on the child when it is created. The worst-rated equation is then replaced by the new child equation.

You can increase or decrease the number of generations that the initial equation population is evolved. For more information, see Working with the current equation population on page 238. You can also specify a variety of equation mutations that can occur, as well as determine the probability that each type of mutation is attempted in a generation. For more information, see Setting genetic analysis preferenceson page 240.

By changing the smoothing parameter d, you can control the bias in the scoring factor between equations with different numbers of terms. For more information, see Setting genetic analysis preferences on page 240.

4. Reviewing the evolved equations

Finally, the evolved equation population is displayed in the equation viewer, sorted by LOF (that is, lack of fit) score. You can now use the equation viewer to scroll through the equations, look at the statistics associated with each equation, sort the equations by other error measures (LSE and r2, for example), and graph various equations.

Genetic analysis produces a graph the shows the frequency that each term (that is, descriptor) is used in all equations in the final population.

5. Using the equations

Once the genetic analysis is complete, you can:


Working with the current equation population

As mentioned above, you can choose GFA or G/PLS from the Method popup on the study table tool bar, then click the Run icon.

If you repeat this step, QSAR+ discards the previous equation population, randomly generates a new equation population, and then evolves that new population.

However, if you want to continue the evolution from the point at which the previous evolution left off or to refresh the current equation population, you want to work with the current equation population.

This section describes the following activities related to working with the current equation population:

Continuing the evolution of the current population (page 239)

Randomizing the current population (page 240)

The information in this section applies to both the genetic partial least squares algorithm and the GFA algorithm.

Before you begin

For more information about either of these tasks, see Chapter 2, QSAR+ QuickStart.

To work with the current equation population

Select Preferences/Statistical Methods... on the study table menu bar and be sure GFA is selected in the Statistical Method popup.

You use this control panel both to continue evolving and to randomize the current population.

Continuing the evolution of the current population

You can continue the evolution of the current equation population, as well as specify the number of generations for which the current equation population will be evolved. By default, the population of equations is evolved for 5000 generations. However, for larger datasets, as well as for situations where you want to perform a more thorough search through the space of possible equations, 5000 generations may not be enough. Or, after performing a preliminary analysis using the default values or adjusted values (as described in Setting genetic analysis preferences on page 240), you may want to evolve a large number of generations over an extended period of time (overnight, for example).

To continue the evolution

1.   In the Generations entry box on the Statistical Method Preferences control panel, enter the number of generations for which you want to evolve the current population.

This value is now also used whenever you choose GFA from the Method popup on the study table and click the Run icon on the study table tool bar.

2.   Click the Run icon. When you do so, the following occurs:

The current population of equations is evolved for the specified number of generations. This evolution occurs according to the default values used in GFA or according to the preferences that you specify. For detailed information about preferences, see Setting genetic analysis preferences on page 240.

The evolved equation population is displayed in the equation viewer. You can use the equation viewer to examine, sort, and graph various equations. For detailed information about the equation viewer, see Chapter 14, Using the Equation Viewer.

Randomizing the current population

To obtain a new randomized set of equations, click the upper More... button on the Equation Viewer control panel and click Delete QSAR Equation Set on the control panel that appears. Then click the Run button on the study table toolbar.


Setting genetic analysis preferences

When C2·Genetic Analysis is installed, the default values that control the processing performed by the genetic function approximation (GFA) algorithm are set to do a reasonable job of building predictive equations. However, you can change these default values to meet your specific requirements. Doing so enables you to exercise detailed control over each analysis that you perform. For example, you can determine the size of the equation population, specify the types of terms that can be used in each equation, select from a variety of equation mutation operations that can occur as the equation population evolves, and specify the probability that each type of mutation is attempted in a generation.

This section explains how you can exercise control over a genetic analysis:

Selecting equation term types (page 241)

Specifying mutation probabilities (page 243)

Specifying other genetic analysis preferences (page 244)

Opening the Configure GFA control panel

You change the default values for genetic analysis by using the Configure GFA control panel. Select Configure on the GENETIC ANALYSIS (GFA) card or select Configure GFA... from the Statistical Method Preferences control panel (be sure the Statistical Method is set to GFA). The Configure GFA control panel allows you to specify the type of terms to be used in equations, mutation probabilities, and other preferences for running the algorithm.

You use this control panel to perform the activities described in the remainder of this section.

Selecting equation term types

You can specify the types of terms that can be used to construct equations both when GFA (or G/PLS) creates a random equation population and when the algorithm attempts to add new terms randomly to child equations through the Add New Term mutation (as described in Specifying mutation probabilities on page 243). By default, only linear polynomial terms are used. This provides for the construction of standard linear equations.

You can choose from among five different types of equation terms, as follows:

Splines are not always useful. If the variables selected are truly linear in their effect on biological activity, splines do not reveal any more-predictive models and may confuse the model building with chance correlations.

The spline terms used in GFA are truncated power splines and are denoted by angle brackets. For example, <f(x) - a> equals zero if the value of (f(x) - a) is negative; otherwise, it equals (f(x) - a). For example, <LogP - 5.5> is zero when LogP < 5.5; otherwise, it is equal to (LogP - 5.5), as shown in the following graph of the truncated power spline <LogP - 5.5>:

The constant a is called the knot of the spline. When a spline term is created, the knot is set using the value of the given feature from a random data sample.

A spline term partitions the data samples into two classes, depending on the value of its feature. The value of the spline is zero for one of the classes and nonzero for the other class. When a spline term is used in a model, the contribution of members of the first class can be adjusted independent of the members of the second class. Thus, regression with splines allows the incorporation of features that do not have a linear effect over their entire range.

Splines are interpreted as performing either range identification or outlier removal:

Range identification -- If there are many members in the nonzero partition, the spline identifies a range of effect. For example, the interpretation of the term <LogP - 5.5> in a model is that only high values of LogP affect the response.

Outlier removal -- If there are only a few members of the nonzero set, the spline identifies outliers. Regression can use the spline term to fit these members independent of the other terms of the model by, in effect, making them special cases based on the extreme value of a feature.

To select an equation term type

Click the appropriate icon.

The icon is highlighted to indicate that equation terms of that type can be used to construct equations.

To deselect an equation term type

Click a highlighted icon to indicate that terms of that type cannot be used to construct equations.

Specifying mutation probabilities

In GFA, mutation is the process of changing a child equation at "birth" to encourage a more thorough search through the space of possible equations that can be constructed. You can choose from a variety of possible equation-mutation operations, by specifying the probability that each selected mutation is attempted in a generation.

Probability refers to the percentage of the time after a child equation is created that a mutation is attempted. If the attempted mutation lowers the fitness score of the child equation, that mutation is not kept. Instead, the original child equation is allowed to proceed. This makes high mutation-probability values relatively safe because, at worst, equation-mutation operations can cause no harm.

You can choose from among the following equation-mutation operations:

To specify a mutation probability

Use the slider or the entry box after the name of each equation-mutation operation to specify the probability that the specified mutation is attempted in a generation. The probability value should be an integer between 0 and 100.

Equation-mutation operations with a probability value of 0 are not performed on a child equation.

Specifying other genetic analysis preferences

You can specify other preferences that control the processing performed by the GFA algorithm:

Establishing the population size

Recall that a genetic analysis begins with building a population of randomly constructed equations. This population is built in both the following situations:

You perform this activity to specify the number of equations that make up this equation population.

To establish the population size

Enter the appropriate number of equations in the Population Size entry box. This value takes effect the next time an equation population is built.

Setting the smoothing parameter d

You can control the bias in the scoring factor between equations with different numbers of terms. Through the smoothing parameter d, you adjust the penalty reflected in the score of equations due to their size. For example, you can:

To set the smoothing parameter d

Enter the appropriate value in the Smoothness (d) entry box.

Setting the number of equation terms

The number of terms to have in randomly constructed equations (that is, the equation length) should be your best estimate of the appropriate equation length.

To set the number of terms

You can do either of the following:

Setting the length of the equation

You can generate equations of specified or unspecified length. The Fixed Length Equations selection is especially useful for the G/PLS method. The LOF error measure applied by the genetic algorithm is not well suited for automatic selection of equation length when pls is used as the internal regression method.

To specify equations of fixed length,

Select Fixed Length Equations from the popup. An entry box is displayed that is used to specify the length of the equation.

Setting the regression method

You can generate equations using least-squares or partial least-squares (pls) as the regression method. The latter method allows models with more variables to be generated and is especially useful for extremely wide datasets in which the useful information is spread over a large number of variables, such as those derived from field analysis (MFA).

For the GFA method, the default is least squares. For G/PLS, the default is pls, with four components and no data scaling.

You can change the default using the Statistical Method popup. If you choose PLS, you also can set the number of components and the type of data scaling to perform.


Using genetic partial least squares

Used for generating QSAR models, genetic partial least squares (G/PLS) is derived from two other methods: genetic function approximation (GFA) and partial least squares (pls). Both GFA and pls are valuable analytical tools for datasets that have more descriptors than samples.

The G/PLS algorithm uses GFA to select appropriate basis functions to be used in a model of the data and pls regression as the fitting technique to weigh the basis functions' relative contributions in the final model. Application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables.

This section describes

Running a G/PLS calculation (below)

Setting G/PLS preferences (page 248)

Running a G/PLS calculation

G/PLS is a variation of GFA in the Cerius2·Genetic Analysis module. It is run in the same way as the GFA algorithm.

Before you begin

You should have added molecular structures, biological activity information, and descriptor values to a study table. Additionally, you should have selected the dependent and independent variables for your analysis.

Setting up and running a G/PLS calculation

1.   Select G/PLS from the Statistical Method popup on the Statistical Method control pane or select G/PLS from the Method popup at the top of the study table.

When you select G/PLS, the Statistical Method Preferences control panel's appearance changes.

2.   If you want to adjust the default G/PLS parameters, click Configure GFA.... The Configure GFA control panel appears. For detailed information on preferences, see Setting genetic analysis preferences on page 240.

To change the configuration for G/PLS.

3.   After selecting G/PLS as your method, select Configure on the GENETIC ANALYSIS (GFA) card or Configure GFA... in the Statistical Method Preferences control panel. The Configure GFA control panel appears in G/PLS mode.

4.   Select the appropriate preferences for your work and the G/ PLS algorithm. For information on setting preferences, see the next section.

5.   When you have completed setting preferences, begin a G/PLS calculation by clicking Run on the study table toolbar.

Setting G/PLS preferences

The G/PLS calculation is specified by selecting the appropriate preferences from the Configure GFA control panel:

1.   Select a value for Randomized Equation Length by entering a number in the entry box. Values between 5 and 15 are typical for a G/PLS run.

2.   Select Fix Equation Length from the popup, then enter a value for the number of components in the equation. This number must be the same as the number you entered in the Randomized Equation Length box.

3.   Choose PLS from the popup at the bottom left of the Genetic Preferences section of the control panel.

4.   Specify the number of components (latent variables) in the components entry box. The more components selected, the more detail is represented in the QSAR model.

5.   Use the popup in the bottom right of the control panel to select the scaling method to apply to variable variances for the PLS calculation:

Scaled -- Normalize all variables to a variance of 1.0.

Unit Scaling -- Scale all variables of the same unit as a group, then equalize the average variance of each group with the average variance of all other groups. This is the most useful method for analysis of data having meaningful differences of variance and between variables of the same type, such as field or probe data.

No Scaling -- Leave data as originally calculated.

Scaling is important because pls tries to preserve the difference in variance between variables. In most QSAR datasets, the differences are due primarily to the choice of unit and are not meaningful. Therefore, some scaling generally should be applied.

At this point, you are ready to run a G/PLS calculation.



MSI Product Previous Next Contents Index Top

Last updated May 18, 2000 at 05:51PM Pacific Daylight Time.
Copyright © 2000, Molecular Simulations Inc. All rights reserved.