The Scripps Research Institute

Interpreting the Human Genome Through Computational Biology

“He went through a brand new planet. Paris rebuilt. Ten thousand incomprehensible avenues.”

——From the film La Jettee by Chris Marker, 1963.

It has been called nothing short of the largest basic science undertaking ever, an erudite feat with a guaranteed place at the table of history, no less impressive than the building the atomic bomb or landing Neil Armstrong and company on the moon: the sequencing of the human genome. Solved to over 93 percent completion and 99 percent accuracy last year by public and private efforts, the genome was published in the two major scientific journals last week by both groups in watershed papers.

Gaps and errors notwithstanding, the sequence shows the correct order and location of all 3.12 billion plus base pairs of DNA along the 23 chromosome pairs which comprise the totality of human genetic information.

Yet at the moment the genome remains deeply shrouded.

Sequencing the genome is just the beginning of the story. We must now tackle the no less impressive undertaking of figuring out what the genome means. Where are the genes and what are they doing? How is the expression of these genes regulated? How do our genes compare with similar genes in other organisms? How do different gene products interact with each other? How do slight changes in them lead to heritable traits and diseases? Can we make pharmaceuticals to target the genes or their products, turning them on or off to treat illnesses?

These questions, says Professor Ruben Abagyan of The Scripps Research Institute (TSRI) Department of Molecular Biology, are going to launch a thousand inquiries in a thousand labs across the country. He compares the coming flurry to the 1849 California gold rush.

For this reason, Abagyan and his colleagues at TSRI are among those who are preparing to take on the important task of finding ways to annotate the genome and find hidden treasures in it, i.e. to “mine the genome.”

Annotation is everything from identifying the genes within the sequence to finding their function, functional families, structures, interacting proteins, and ligands. Annotation is the interpretation of the information, the meaning of the words, the knowledge rather than the data.

Finding new ways of organizing this information is necessary because as scientists discover more and more about all the parts of the genome, the amount of information explodes—and fragments.

Scientists submit their data to many separate databases, each with its own specialty. Genomic information from different species, genes, proteins and protein families, expression levels and tissue distributions, individual sequence differences (SNPs) and associated phenotypes, small biologically active molecules, and, finally three dimensional structures of biological polymers and their complexes.

A Whole Less than the Sum of its Parts

How can computational biologists create the systems to access this information in a meaningful way?

"The first step," says Abagyan, "is creating an environment where you can actually browse all the information—query, extract, and analyze as you wish."

This is no easy task. Connecting each gene to other databases via ordinary hypertext links, as one might imagine doing at first, would not be a realistic way to organize the information. An annotated gene would point to several different items in several different databases, each with a different arbitrary format and naming convention and constructions.

For instance, a single human Ig domain in a single gene would point to hundreds of other Ig domains in other human genes and thousands in other organisms. The domain may also be linked to similar domains in a domain database, may have links to thousands of protein domains in a structure database. The gene itself may be linked to countless other genes through similarities in sequence, structure, function, family, chromosome, or organism.

Perhaps one could surf the genomic web in this form, but who would want to? Each gene would be endlessly linked to a plurality of self referential sites and each new organism would add another order of complexity to the tangled mess. "After three genes, you’ll be exhausted," says Abagyan.

One could also simply look within one database or another for genomic information. But this would deny the promise of discovery that is the human genome sequence. "If you just take a piece of the genome in isolation, it’s not interesting, basically," says Abagyan. "These [databases] have all the little bits and pieces that we have to put together to make the genome alive. Otherwise, it’s just a sequence."

Abagyan and his TSRI colleagues are creating an environment in which all the information can be sorted and the redundancies removed. The individual databases must be able to be consumed, combined, digested, and displayed in a standardized, relational form. "Right now it’s a complete mess," he says. "The entire deck is shuffled."

The ultimate goal will be to produce a functional map of the human genome, where all the genes are identified and understood. A protein catalog of clustered genes represented by a hierarchical set of folders, using a standardized set of annotations and conventions within. And one that would contain 10 fold less information than all its constituents.

Abagyan believes that such a map is not that far off, a few of years, perhaps. "It feels like it’s within reach," he says.

Into the Twilight Zone

In addition to working on ways to organize the annotated human genome, Abagyan works actively to contribute to the annotation itself, using homology modeling and docking.

Homology modeling has traditionally been a tool for determining which functional family a gene or a protein belongs to by comparing a sequence from an unknown protein or gene to a database of known entities.

Obviously, no two genes will be exactly alike, but we could try to predict if the unknown protein would adopt a similar fold as a known one.

Occasionally, a conserved active site is enough to identify an unknown gene’s function. In fact, HIV-1 protease was identified shortly after the sequence of the HIV genome was published because its gene contained a known aspartic proteolytic enzyme motif.

But even without an obvious active site, two nearly identical sequences of DNA from two different organisms would definitely code for proteins of similar function and near identical fold—the differences would be in the conformation of the side chains and the loops in the peptide backbone.

The problem arrives in a form referred to as the "twilight zone." Typically, scientists employ an arbitrary cutoff, which means that any genes that are similar to a certain degree, say thirty percent, will be treated as homologous. Conversely, any two genes with less than this cutoff will be ignored.

But the sequence similarity disappears long before the structural or functional similarities do, and two genes that have only fifteen to thirty percent identity may code for proteins that have the same function, even though they would be missed by a homology search. These false negatives are said to be in the twilight zone.

"One of the goals is to be able to see in the twilight zone," says Abagyan. He works on new procedures to align sequences involving large gaps, dissimilar fragments in the middle of an alignment, and iterative chains of sequence comparisons. He proposed the "multilink recognition" algorithm in 1996 and used it to recognize remote similarities.

These indicators can then be given to biochemists, who will then determine whether the function of the enzyme is assumed correctly. Furthermore, recognized similarities may indicate similar folds and similar crystallization conditions, information that can be given to structural biologists to speed up their work.

Model Building and Protein-Ligand Docking

Another tool Abagyan uses to annotate the genome is docking. “You take a protein,” he says, “and you ask, ‘What small molecule binds to it?’ and ‘Can I design a small molecule that will inhibit it?’”

The basis for docking ligands to certain receptors comes from knowledge of the atomic structures of the receptors themselves, which is acquired through biophysical techniques, such as x-ray crystallography and nuclear magnetic resonance. Experimentally determined structures may not be necessary for each receptor, though. "Very often the active site is close enough to the homologue that binding studies can be done without having a complete structure," says Abagyan. "It gives you something to work with."

Given the structure, a model by homology, or a presumed binding site model of the target, one tries to insert any number of ligands to the binding site, perhaps scoring them according to how well they fit in that site.

This is all done computationally, with the molecular structures of the molecules being represented in a three-dimensional coordinate system where all the parts of the two molecules can interact with each other electrostatically, sterically, hydrophobically, and through hydrogen bond formation to search possible conformations in order to find the global free energy minimum—the so-called best fit. The best fitting ligand is the one that makes the most favorable interactions with the binding site.

Applying this basic technique, Abagyan generally subjects target receptors to hundreds of thousands of commercially available compounds. The flexible docking procedure samples hundreds of possible conformations of the ligand in the surface pockets of the receptor and assigns a score to the ligand. The score is used to rank and order the entire chemical collection. The end result will be several dozen or hundred of virtual inhibitors—lead compounds that can then be taken into the laboratory used as inhibitors, or scaffolds to create better inhibitors.

The most difficult part, says Abagyan, is following up on these computational techniques with experimental ones—structural studies, synthesis of lead candidates, molecular binding assays, cell-based assays, and so forth.

"You can’t do this alone," he says.

Go back to News & Views Index

“Bioinformatics today has to be centered on the human genome sequence,” says Abagyan.

Like many others in bioinformatics, the field that stands between molecular biology and computer science, Abagyan has his roots in traditional computational biology—homology modeling, molecular modeling, and docking. He wants to extend these techniques to work on the human genome.