Andrew Su Sets Out to Scale Mountains of Data

By Mark Schrope

Over the last century, scientists have often followed a simple path. A researcher studied a narrow, defined field, published papers on findings and kept up with all the other key discoveries within this area. But in today’s increasingly complex fields such as genomics, there is simply too much to follow. Researchers require increasingly complex tools to find what they need. Creating these tools and applying them effectively are among the most important modern scientific challenges.

Andrew Su, who earned at PhD at The Scripps Research Institute (TSRI) in 2002 and returned in 2011 as an associate professor in the Department of Molecular and Experimental Medicine, dedicates his time to these goals. He and his colleagues develop tools for researchers around the world and devise ways to glean information from existing datasets for transformative insights about basic biology and disease.

Raised in San Jose, CA, and drawn to computers, Su was a natural candidate for a life focused on technology. His father was an electrical engineer, so he had plenty of computer access. But recognizing the potential in himself for some form of screen-based addiction, he consciously avoided computer gaming, opting instead for outdoor pursuits such as tennis, rock climbing and backpacking.

It wasn’t until he went to Northwestern University as an undergraduate that Su got serious about computers and programming. From the start, he was most attracted to the idea of harnessing technology to accomplish specific goals.

By the time Su was ready for graduate studies in 1998, his hometown was the epicenter of the dot-com boom, and associated technological leaps were spilling into science. Biologists were generating mountains of data, and it didn’t take long before people started wondering how they could process it all effectively.

Su was drawn to genomics, and nowhere were the mountains of data higher. “I walked into it at the right time,” he said. “There had been a quantum leap in terms of the amount of data scientists were generating, and they needed new techniques to process all that.”

Bioinformatics for Research Scientists

In the graduate program at TSRI, working under chemist Peter Schultz, Su began splitting his time between hands-on biology and computational analysis. He did some experimental work characterizing the mechanisms of potential drug candidates, but also became involved with setting up a “gene atlas” to delineate the locations in the body tissues of mice and humans where specific genes were activated.

Within a few years, it became clear to both Su and Schultz that his talents could best be applied on the computing side, developing the techniques for pulling information from large datasets to enable discoveries that might otherwise be missed. “It just became apparent that it was more productive for me to work with people who had generated really high-quality data rather than spend the time to generate it myself,” said Su.

Nonetheless, those initial experimental years would provide critical context. “Andrew’s training as an experimentalist gave him a real appreciation for the tools that are needed,” said Schultz. “He is rather unusual in this regard, and this is why his tools are so widely appreciated and used. Andrew is a research scientist’s bioinformatician.”

In addition to performing analyses of the gene atlas data, Su set out to create a searchable online database. The project’s leader was pushing the idea of making the atlas open to other research groups, and it was Su’s efforts that would make that possible. The motivation was simple, if a bit unorthodox at the time. “This data set was large enough and interesting enough that no matter what we did to it, those efforts were going to be dwarfed by the opportunities other people had to mine it,” said Su.

In other words, the most benefit from their work would come by setting the data free so the scientific community could work on it collectively. The experiment worked, and to date, the dataset has been cited by other researchers more than 2,000 times.

Tapping into Collective Wisdom

After completing his PhD at TSRI in 2002, Su joined the Genomics Institute of the Novartis Research Foundation and there, among other efforts, began to build on the gene atlas success to create BioGPS. BioGPS is a publicly accessible web tool that researchers can use to plug in one or more genes they are interested in to get a wealth of information. The core library is the combination of countless databases.

BioGPS is far from a simple catalog. Hundreds of computer tools are available for searching information about genes, but these tend to be limited to specific types of information. The motivation for BioGPS was to bring all the information together in a single portal. Users can choose from hundreds of optional plugins that display different slices of gene information to meet specific needs in fields such as molecular biology or immunology.

Like most of Su’s work, BioGPS taps the power of crowdsourcing—using the collective wisdom of a community to improve a product. In this case, the tool is set up so that individual users can easily add new plugins to accomplish specific goals that others might also find beneficial. And BioGPS collects such information as which plugins are most popular among certain types of users, so this information can guide those new to the web tool.

Today, tens of thousands of researchers are using BioGPS, leading to about 2 million website hits per year. In response to a recent request for testimonials in support of a new round of funding for BioGPS, high praise was flowing. One user called it “the best portal on the Web,” another, “an essential tool in my scientific toolbox.”

While assembling the BioGPS database, Su realized his team could put much of the information into an even more easily accessible form, a project he refers to as the Gene Wiki. Researchers in the Su lab programmed a system that exported the BioGPS data on an individual gene into an article for Wikipedia, the massive free online crowdsourcing encyclopedia. Wikipedia already has a well-developed system for incorporating user information, so biologists can and do easily add relevant new details to the articles, expanding and enhancing the resource. Today, there are more than 10,000 articles on human genes in Wikipedia, and they collectively get viewed over 4 million times per month.

Playing Games

But even with the success of BioGPS and Gene Wiki, Su continued looking for ways to tap biologists’ collective wisdom. Recently, Su’s team began looking toward gaming. When Su avoided computer games in his younger years, it wasn’t because he didn’t like them; he was wary of their powerful pull. Globally, people spend an estimated 150 billion hours playing games each year. “If we can harness even a sliver of that gaming time,” said Su, “we would have a huge resource to do something productive.”

As part of a growing movement known as “gamification”—using games to accomplish concrete tasks—Su’s team has created two games.

One prototype game, called Dizeez, can establish connections between specific genes and diseases (see http://genegames.org). Designed for people with a background in the field, the game names a disease and offers five gene choices. Players choose the best association and they score points for correct answers as judged according to information in the BioGPS database. In this forum, players effectively confirm or discount suspected associations. The Su lab has also mined the game playing logs to flag novel associations that are not already established, or at least not yet recognized, in biomedical databases.

Hundreds of people have already played Dizeez more than 1,000 times. Like others, Su has found that getting people to play a game is much easier than convincing them to fill out a boring survey. They’ve experimented with prizes. But, said Su, “Those tangible things in real life pale in comparison to the real-time high of succeeding in the game environment. That’s what makes it so powerful.”

A second game, called The Cure, came about in response to a competition launched by a group called Sage Bionetworks to see who could come up with the best system to predict which genes determine the ultimate prognoses of actual breast cancer patients. In the game that the Su team created, players chose from a list of possible gene types to create what they felt was the best hand of genes to predict the severity of the patient’s disease. The players’ collective votes, in the form of the hands they chose, led to correct assessments 70 percent of the time. The winning computer algorithm was only 2 percent better.

Big Opportunities

As much collective wisdom as there is to unleash in biology, there’s a great deal that remains undiscovered—in some cases only because existing data has been too voluminous to probe fully. So another key focus in Su’s lab is on mining promising datasets for new discoveries. This typically involves tweaking and expanding other tools to accomplish specific goals.

Since arriving at TSRI, Su has begun collaborating with a variety of research groups to analyze their genetic data in order to better understand basic biology and diseases such as cystic fibrosis and osteoarthritis—work that could uncover starting points for new treatments.

“I enjoy the diversity of biological problems we get to tackle,” said Su. “To work on cancer one day and osteoarthritis the next day and protein folding the day after keeps it all very exciting.”

Despite the diversity of projects and the variety of approaches his lab has taken, Su sees a common theme and many years of work ahead: “There are a lot of big opportunities in harnessing all the knowledge of the scientific community in a way that everybody benefits.”

Send comments to: press[at]scripps.edu

“There are a lot of big opportunities in harnessing all the knowledge of the scientific community in a way that everybody benefits,” says Associate Professor Andrew Su.

SEARCH NEWS & VIEWS

Andrew Su Sets Out to Scale Mountains of Data

Scripps Florida Invites Community to CELLebrate

TSRI Names David Blinder to Key Fundraising Position

Grant Funds Development of Drug Candidates for Rheumatoid Arthritis, Neurodegenerative Disorders

Andrew Su Sets Out to Scale Mountains of Data