Science Talk:
After the Genome


"The Genome, We Are Sure, Is Packed with Subtleties"

Paul Schimmel, Professor, Department of Molecular Biology

It's very exciting. None of us know what treasures lie beneath the sequence.

There has been a huge capital investment, not only by the government through the National Institutes of Health, the National Cancer Institute, and the National Science Foundation, but also through private foundations, like the American Cancer Society, the Howard Hughes Medical Institute, the Wellcome Trust, [and] private industry, particularly on the entrepreneurial side.

All of [these organizations] have large programs trying to understand, at the beginning, the function of the proteins that are encoded by the human genome. Rarely have we seen capital investment coming from so many corners focused on one problem.

Over the next 10 years, I believe probably 90 percent of the proteins will have an assigned function. Maybe that's too optimistic, but it's certainly within reach.

What is much harder to come to grips with is how [to] put it all together to make an organism. How does this all fit together? There are approaches being used by diverse groups trying to knock out genes and relate them to phenotypes, particularly related to embryonic development and differentiation.

The genome, we are sure, is packed with subtleties—the expression on your face, body language, intuitive faculties, gestures, the things that we do that we don't even think about—these are things that we don't understand at all in a detailed sense as they relate to the genome, but more and more we're getting the feeling that [these subtleties] are genetically encoded. They are part of this array that we just don't understand.

That's where the advances need to be made. What are these genes? Even if you know the proteins, how do they work to generate a highly sophisticated organism?

I think that we will have all the pieces to the jigsaw puzzle figured out ("this must be part of a lake and this must be part of a forest over here, and this must be part of a house over here"). Putting them together to get the whole picture is very difficult.

How long that will take is harder. Will it be in the next 100 years? That's a good question. I do believe the end result will be that humans will have a sense of how you go from a puffer fish to a mouse to a human—organisms with a similar numbers of genes and many of the same genes, but obviously [leading to] very different outcomes.

Functional Analysis and Genetic Diversity in Yeast and Malaria

Elizabeth Winzeler, Assistant Professor, Department of Cell Biology

One of the big areas of investigation in the post-genome era will be assigning function to the genes that are predicted in the genome project. One of the techniques that I am most familiar with is expression profiling. As genome sequences become available, it's easy to create arrays [of various nucleotide sequences] that can then interrogate every gene in the genome. Then, by hybridizing the RNA from different tissues or disease states or different stages of an organism's life cycle, you can start determining when a gene is probably transcriptionally active, and that actually gives you quite a bit of information about the potential functional role for that gene.

This can really go a long way towards narrowing down the list of potentially interesting targets that you might want to concentrate on if you are involved in the drug discovery process.

I started working on post-genome functional analysis in [the budding yeast] Saccharomyces in 1996, right after the genome sequence was released, and I became involved in a number of different projects—developing tools for expression profiling as well as creating knockout strains for every gene in the yeast genome. I'm still doing a little bit of yeast research. For example, we [also] recently used oligonucleotide arrays to map all of the chromosomal origins of DNA replication—there are about 400 in yeast—by isolating DNA fractions that were enriched for origin activity and then hybridizing the fractions to high density oligonucleotide arrays.

We have also used oligonucleotide arrays to study genetic diversity in yeast. Usually, only one strain or individual representative from a particular organism is sequenced. By comparing the patterns which result when genomic DNA is hybridized to arrays, we can find out how closely related different strains are. I've looked at 10 or 11 different yeast isolates. I think this technology is going to be very interesting to population geneticists in the future. You can get a much more descriptive look at the genome, and you can find regions of the genome that are evolving at faster rates.

In the past couple years, I've been working on applying this type of technology to organisms that are more difficult to work with and are more relevant to human health. The malaria parasite has a genome size that is about two times as large as yeast. The sequence has been done for about six months, and the annotations should become available [soon]. The parasite also has both haploid and diploid phases, like Saccharomyces, but has a complex life cycle involving both humans and mosquitoes, is difficult to maintain in culture, and has gene function that cannot be studied using classical forward genetics.

Malaria is a major health problem worldwide. There are 300 million cases a year, and there has been a resurgence in the number of cases because of drug resistance. Many inexpensive anti-malarials are no longer effective.

While genetic studies are difficult, it's relatively easy to get RNA from all the different stages of the parasite's lifecycle and this offers us new ways to study gene function in the parasite. In the past year, I've designed an oligonucleotide array that contains about 500,000 probes to two different Plasmodium genomes [a mouse strain, and the human strain]. The array we designed at TSRI arrived a month or two ago, and what we are doing now is collecting RNA samples from many different conditions. We're exposing parasites to drugs to identify new genes involved in [resistance] pathways. We're hybridizing genomic DNA in order to characterize genetic diversity in different field isolates and find out how similar or different the isolates are. Eventually, we'd like to take these tools into the field and map the spread of drug resistance.

If you start doing longitudinal studies after you introduce a new drug, you might be able to identify the drug targets or the mechanisms of resistance, because we predict we will see pockets of variability developing within the genome over time that are associated with the drug's target. This may lead to new knowledge about the mechanisms of drug resistance. If you can start finding the mutations that are associated with drug resistance, then that tells you how to treat patients in the field.

"The Main Reason to Sequence The Genome Was to Facilitate Positional Cloning"

Bruce Beutler, Professor, Department of Immunology

It will take a very long time to close the phenotype gap. The fact is, there are about 34,000 genes, give or take a few thousand. If you add up all the phenotypes known from mutations in humans and from knockouts in mice, you come up with about 5,000. So something like six out of seven genes don't have an essential function attached to them yet.

The way that people go about identifying phenotypes now is to mutate every gene in the genome and keep certain phenotypes of interest to them under surveillance. In this way, in principle, one can find every gene that is required for a particular function. Once you have a phenotype, then comes the problem of finding the particular mutation that caused it. That's done by positional cloning. That's where sequencing the genome has been particularly useful.

In fact, the main reason to sequence the genome was to facilitate positional cloning. I think a lot of people don't realize that. It's a rapid way to find the function of genes.

In the old days, when you positionally cloned something, you first had to map the mutation. By following meiosis, you would confine the mutation to a point between two markers on the chromosome—hopefully a very small area, less than a million base pairs long. Second, you would have to clone all the DNA from end-to-end across that area. Third, you would have to find all the genes that were candidates in that area. And finally, you would have to find the mutation.

The sequencing of the genome has made it so that you don't have to do steps two and three anymore. You no longer have to clone all the DNA across the area, because the sequence is known. And you no longer have to look for genes because, in principle, they've all been found and annotated. Now the limiting factor in finding mutations is doing the genetic mapping, and that might take about a year. Then finding the gene, in theory, should be trivial. It used to be that the process of cloning the critical region and identifying candidates would, by themselves, take several years. So things have gotten a lot easier.

"You Can't Get Too Hung Up On Any One Protein"

Ian Wilson, Professor, Department of Molecular Biology

The overall plan for the Joint Center for Structural Genomics is to try to produce as many new structures as possible. By "new" we mean ones for which you can't predict the fold from the sequence. However, a lot of these will turn out to be similar structures to others. For example, we have recently worked on a protein that is less than 15 percent identical to anything in the Potein Data Base, and we found out its structure is [almost] identical to another protein.

To start off, we've been concentrating on one organism, Themotoga maritima to see how much of it we can clone, express, purify, crystallize, collect synchrotron data, determine the structure, and deposit in the databank. In collaboration with Scott Lesley of GNF, we're trying to see how many proteins from that one organism we can pass through the various steps of the pipeline that are required [for] high-throughput structural genomics.

The other organism that we're currently working on is C.elegans. These are likely to be much more difficult proteins to express. They're more complex, but they're more representative of eukaryotic organisms, such as mouse and human [the specific organism]. Here, we are concentrating on proteins that are likely to have novel folds or at least have folds that cannot be predicted at present.

For proteins that we are really interested in, we can also look for homologues and orthologues in other organisms. But in structural genomics, you can't get too hung up on any one protein, because it's a numbers game. The goal, which the NIH suggests that we should be able to achieve, is, in year four [of the project], to produce 100 to 200 structures per year. That comes down to nearly one every working day. And within four to six weeks from the time we have finished refining the structure, we have to deposit them into the Protein Data Bank.

That's what we're working towards and that's what we're trying to achieve. And since everything is deposited in the public domain, that information is accessible to everybody. Thus, the structures produced by structural genomics should enable the work of biologists, molecular biologists, and cell biologists worldwide.