| QSAR |

Genetic algorithms are derived from an analogy with the spread of mutations in a population. In this analogy, "individuals" are represented as a one-dimensional string of bits. An initial population of individuals is created, usually with random initial bits. A fitness function is used to estimate the "quality" of an individual, so that the "best" individuals receive the best fitness scores. Individuals with the best scores are more likely to propagate their genetic material to offspring through crossover, in which pieces of genetic material are taken from each parent and recombined to create the child. After many such mating steps, the average fitness of the individuals in the population shoule increase, as good combinations of genes are discovered and spread through the population. Genetic algorithms are especially good at searching problem spaces having a large number of dimensions, since they conduct a very efficient, directed sampling of the large space of possibilities.
Friedman's MARS algorithm is a statistical technique for modeling data. It provides an error measure, called the lack-of-fit (LOF) score, that automatically penalizes models with too many features. It also inspired the use of splines as a powerful tool for nonlinear modeling.
The GFA algorithm uses a genetic algorithm to perform a search over the space of possible QSAR/QSPR models using the LOF score to estimate the fitness of each model. Such evolution of a population of randomly constructed models leads to the discovery of highly predictive QSARs/QSPRs.
The GFA algorithm approach has a number of important advantages over other techniques:
The G/PLS algorithm uses GFA to select appropriate basis functions to be used in a model of the data and pls regression to weight the basis functions' relative contributions in the final model. Application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables.
G/PLS is run in the same way as GFA. To set up a G/PLS calculation, you must specify the G/PLS method and certain other settings in the Configure GFA control panel. For more information, see Setting genetic analysis preferences on page 240.
Starting a genetic analysis
Packaged as a separate Cerius2 module, C2·Genetic Analysis can be used as another of the statistical methods available in QSAR+ for generating QSAR equations.
To use GFA as the statistical method for a QSAR analysis, choose GFA from the Method popup at the top of the study table, then click the Run button on the QSAR tool bar. Configuring GFA options is described in Genetic function approximation on page 208.
Performing a genetic analysis
When C2·Genetic Analysis is installed, the parameters that control the processing performed by the GFA algorithm are set to do a reasonable job of building predictive equations. This section describes the processing that occurs by default when you start a genetic analysis as described in the previous section. This section also points you to other chapters that explain how you can change the built-in settings and control the genetic analysis process.
Performing a genetic analysis using the built-in settings consists of five basic steps.
1. Starting the analysis
Before you start the genetic analysis:
2. Building the initial population
The analysis begins by building a population of 100 randomly constructed equations. These random equations are displayed in the equation viewer.
You can change the size of the initial equation population, as well as change both the number and type of terms to be used in each of these initial equations. For more information, see Setting genetic analysis preferences.
3. Evolving the population
The initial population is then evolved for 5000 generations. Evolving the population means that, for each generation, two better-scoring equations are selected as parents. Parts of each parent equation are then used to create a child equation. Optional equation-mutation operations may be performed on the child when it is created. The worst-rated equation is then replaced by the new child equation.
You can increase or decrease the number of generations that the initial equation population is evolved. For more information, see Working with the current equation population on page 238. You can also specify a variety of equation mutations that can occur, as well as determine the probability that each type of mutation is attempted in a generation. For more information, see Setting genetic analysis preferenceson page 240.
By changing the smoothing parameter d, you can control the bias in the scoring factor between equations with different numbers of terms. For more information, see Setting genetic analysis preferences on page 240.
4. Reviewing the evolved equations
Finally, the evolved equation population is displayed in the equation viewer, sorted by LOF (that is, lack of fit) score. You can now use the equation viewer to scroll through the equations, look at the statistics associated with each equation, sort the equations by other error measures (LSE and r2, for example), and graph various equations.
Genetic analysis produces a graph the shows the frequency that each term (that is, descriptor) is used in all equations in the final population.
5. Using the equations
Once the genetic analysis is complete, you can:

If you repeat this step, QSAR+ discards the previous equation population, randomly generates a new equation population, and then evolves that new population.
However, if you want to continue the evolution from the point at which the previous evolution left off or to refresh the current equation population, you want to work with the current equation population.
This section describes the following activities related to working with the current equation population:
To work with the current equation population
Select Preferences/Statistical Methods... on the study table menu bar and be sure GFA is selected in the Statistical Method popup.
You use this control panel both to continue evolving and to randomize the current population.
Continuing the evolution of the current population
You can continue the evolution of the current equation population, as well as specify the number of generations for which the current equation population will be evolved. By default, the population of equations is evolved for 5000 generations. However, for larger datasets, as well as for situations where you want to perform a more thorough search through the space of possible equations, 5000 generations may not be enough. Or, after performing a preliminary analysis using the default values or adjusted values (as described in Setting genetic analysis preferences on page 240), you may want to evolve a large number of generations over an extended period of time (overnight, for example).
2. Click the Run icon. When you do so, the following occurs:
When C2·Genetic Analysis is installed, the default values that control the processing performed by the genetic function approximation (GFA) algorithm are set to do a reasonable job of building predictive equations. However, you can change these default values to meet your specific requirements. Doing so enables you to exercise detailed control over each analysis that you perform. For example, you can determine the size of the equation population, specify the types of terms that can be used in each equation, select from a variety of equation mutation operations that can occur as the equation population evolves, and specify the probability that each type of mutation is attempted in a generation.
Setting genetic analysis preferences
This section explains how you can exercise control over a genetic analysis:
Opening the Configure GFA control panel
You change the default values for genetic analysis by using the Configure GFA control panel. Select Configure on the GENETIC ANALYSIS (GFA) card or select Configure GFA... from the Statistical Method Preferences control panel (be sure the Statistical Method is set to GFA). The Configure GFA control panel allows you to specify the type of terms to be used in equations, mutation probabilities, and other preferences for running the algorithm.
You use this control panel to perform the activities described in the remainder of this section.
Selecting equation term types
You can specify the types of terms that can be used to construct equations both when GFA (or G/PLS) creates a random equation population and when the algorithm attempts to add new terms randomly to child equations through the Add New Term mutation (as described in Specifying mutation probabilities on page 243). By default, only linear polynomial terms are used. This provides for the construction of standard linear equations.
You can choose from among five different types of equation terms, as follows:
The icon is highlighted to indicate that equation terms of that type can be used to construct equations.
To deselect an equation term type
Click a highlighted icon to indicate that terms of that type cannot be used to construct equations.
Specifying mutation probabilities
In GFA, mutation is the process of changing a child equation at "birth" to encourage a more thorough search through the space of possible equations that can be constructed. You can choose from a variety of possible equation-mutation operations, by specifying the probability that each selected mutation is attempted in a generation.
Probability refers to the percentage of the time after a child equation is created that a mutation is attempted. If the attempted mutation lowers the fitness score of the child equation, that mutation is not kept. Instead, the original child equation is allowed to proceed. This makes high mutation-probability values relatively safe because, at worst, equation-mutation operations can cause no harm.
You can choose from among the following equation-mutation operations:
Use the slider or the entry box after the name of each equation-mutation operation to specify the probability that the specified mutation is attempted in a generation. The probability value should be an integer between 0 and 100.
Equation-mutation operations with a probability value of 0 are not performed on a child equation.
Specifying other genetic analysis preferences
You can specify other preferences that control the processing performed by the GFA algorithm:
To establish the population size
Enter the appropriate number of equations in the Population Size entry box. This value takes effect the next time an equation population is built.
Setting the smoothing parameter d
You can control the bias in the scoring factor between equations with different numbers of terms. Through the smoothing parameter d, you adjust the penalty reflected in the score of equations due to their size. For example, you can:
Enter the appropriate value in the Smoothness (d) entry box.
Setting the number of equation terms
The number of terms to have in randomly constructed equations (that is, the equation length) should be your best estimate of the appropriate equation length.
You can do either of the following:
To specify equations of fixed length,
Select Fixed Length Equations from the popup. An entry box is displayed that is used to specify the length of the equation.
Setting the regression method
You can generate equations using least-squares or partial least-squares (pls) as the regression method. The latter method allows models with more variables to be generated and is especially useful for extremely wide datasets in which the useful information is spread over a large number of variables, such as those derived from field analysis (MFA).
For the GFA method, the default is least squares. For G/PLS, the default is pls, with four components and no data scaling.
You can change the default using the Statistical Method popup. If you choose PLS, you also can set the number of components and the type of data scaling to perform.
Used for generating QSAR models, genetic partial least squares (G/PLS) is derived from two other methods: genetic function approximation (GFA) and partial least squares (pls). Both GFA and pls are valuable analytical tools for datasets that have more descriptors than samples. 
Using genetic partial least squares
The G/PLS algorithm uses GFA to select appropriate basis functions to be used in a model of the data and pls regression as the fitting technique to weigh the basis functions' relative contributions in the final model. Application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables.
You should have added molecular structures, biological activity information, and descriptor values to a study table. Additionally, you should have selected the dependent and independent variables for your analysis.
Setting up and running a G/PLS calculation
2. If you want to adjust the default G/PLS parameters, click Configure GFA.... The Configure GFA control panel appears. For detailed information on preferences, see Setting genetic analysis preferences on page 240.
To change the configuration for G/PLS.
Setting G/PLS preferences
The G/PLS calculation is specified by selecting the appropriate preferences from the Configure GFA control panel: