(page 2 of 2)

Into the Twilight Zone

In addition to working on ways to organize the annotated human genome, Abagyan works actively to contribute to the annotation itself, using homology modeling and docking.

Homology modeling has traditionally been a tool for determining which functional family a gene or a protein belongs to by comparing a sequence from an unknown protein or gene to a database of known entities.

Obviously, no two genes will be exactly alike, but we could try to predict if the unknown protein would adopt a similar fold as a known one.

Occasionally, a conserved active site is enough to identify an unknown gene’s function. In fact, HIV-1 protease was identified shortly after the sequence of the HIV genome was published because its gene contained a known aspartic proteolytic enzyme motif.

But even without an obvious active site, two nearly identical sequences of DNA from two different organisms would definitely code for proteins of similar function and near identical fold—the differences would be in the conformation of the side chains and the loops in the peptide backbone.

The problem arrives in a form referred to as the "twilight zone." Typically, scientists employ an arbitrary cutoff, which means that any genes that are similar to a certain degree, say thirty percent, will be treated as homologous. Conversely, any two genes with less than this cutoff will be ignored.

But the sequence similarity disappears long before the structural or functional similarities do, and two genes that have only fifteen to thirty percent identity may code for proteins that have the same function, even though they would be missed by a homology search. These false negatives are said to be in the twilight zone.

"One of the goals is to be able to see in the twilight zone," says Abagyan. He works on new procedures to align sequences involving large gaps, dissimilar fragments in the middle of an alignment, and iterative chains of sequence comparisons. He proposed the "multilink recognition" algorithm in 1996 and used it to recognize remote similarities.

These indicators can then be given to biochemists, who will then determine whether the function of the enzyme is assumed correctly. Furthermore, recognized similarities may indicate similar folds and similar crystallization conditions, information that can be given to structural biologists to speed up their work.

Model Building and Protein-Ligand Docking

Another tool Abagyan uses to annotate the genome is docking. “You take a protein,” he says, “and you ask, ‘What small molecule binds to it?’ and ‘Can I design a small molecule that will inhibit it?’”

The basis for docking ligands to certain receptors comes from knowledge of the atomic structures of the receptors themselves, which is acquired through biophysical techniques, such as x-ray crystallography and nuclear magnetic resonance. Experimentally determined structures may not be necessary for each receptor, though. "Very often the active site is close enough to the homologue that binding studies can be done without having a complete structure," says Abagyan. "It gives you something to work with."

Given the structure, a model by homology, or a presumed binding site model of the target, one tries to insert any number of ligands to the binding site, perhaps scoring them according to how well they fit in that site.

This is all done computationally, with the molecular structures of the molecules being represented in a three-dimensional coordinate system where all the parts of the two molecules can interact with each other electrostatically, sterically, hydrophobically, and through hydrogen bond formation to search possible conformations in order to find the global free energy minimum—the so-called best fit. The best fitting ligand is the one that makes the most favorable interactions with the binding site.

Applying this basic technique, Abagyan generally subjects target receptors to hundreds of thousands of commercially available compounds. The flexible docking procedure samples hundreds of possible conformations of the ligand in the surface pockets of the receptor and assigns a score to the ligand. The score is used to rank and order the entire chemical collection. The end result will be several dozen or hundred of virtual inhibitors—lead compounds that can then be taken into the laboratory used as inhibitors, or scaffolds to create better inhibitors.

The most difficult part, says Abagyan, is following up on these computational techniques with experimental ones—structural studies, synthesis of lead candidates, molecular binding assays, cell-based assays, and so forth.

"You can’t do this alone," he says.

1 | 2 |




“Bioinformatics today has to be centered on the human genome sequence,” says Abagyan.















See also:

National Center for Biotechnology Information

Ruben Abagyan's Laboratory

Human Genome Project Working Draft at University of California, Santa Cruz

The publicly funded Genome International Sequencing Consortium's human genome sequence, as published in Nature

The privately funded Celera Genomics' human genome sequence, as published in Science Magazine