BIG DATA
Unpacking Protein Function
unravel life’s most important molecules
Proteins make our bodies work — they are the movers and shakers of the organism, breathing, digesting, building, and synthesizing energy. But when proteins go wrong, they can turn from cogs in the wheel of life to triggers of disease.
Despite sequencing the human genome, the three-dimensional structures of only a small fraction of the billions of proteins found in all living organisms have been determined, mostly by x-ray crystallography. Fortunately, this rapidly expanding corpus of protein data constitutes a test-bed for scientists to develop the computational technology and predictive tools that will help us learn about the remainder of the unknown proteins.
“The largest unsolved problem in the computational biology of proteins is how to take a protein sequence and accurately predict the three-dimensional structure it folds into,” said Nick Grishin, professor of biophysics at The University of Texas Southwestern Medical Center and a Howard Hughes Medical Institute investigator. “These structures are functional as three-dimensional bodies — they bind some molecules, they catalyze reactions — and knowing that structure is extremely important for the functional understanding of proteins.”
Grishin is pursuing this “unsolved problem” of protein structure prediction using Ranger— one of the world's most powerful supercomputers — at the Texas Advanced Computing Center (TACC). His computations have led to new insights on protein activity, potential disease cures, and a deeper understanding of how humans evolved into such complex creatures.
Competing Predictions
The structure of the largest target protein (T0487, Argonaute protein) in the recent CASP competition (Critical Assessment of Techniques for Protein Structure Prediction, round 8) [1], which was successfully predicted from sequence only. For one of the domains of this structure (domain 4, orange) the quality of the team's prediction was far beyond other 200 world competitors [2]. [see below for references]
Grishin’s research efforts on Ranger got off to a promising start shortly after the system debuted in February 2008.Collaborating with David Baker, professor of biochemistry at the University of Washington, the two researchers entered the 8th-bi-annual CASP (Critical Assessment of Techniques for Protein Structure Prediction) competition to predict the structures of a number of unknown proteins. The contest pits algorithm against algorithm, as researchers worldwide battle to predict a protein’s tertiary structure by its primary sequence alone. Their result is then verified against a laboratory-produced experimental structure.
“David Baker has the best program in the world to refine three-dimensional structures in space [known as Rosetta], and we think we have a good program to align sequences with existing structures to determine how to model them,” Grishin said. “We collaborated, and for a whole summer, ran those predictions.”
Baker and Grishin’s protein structure prediction took top honors in the ‘most difficult to predict’ category, demonstrating the strength of their method and the power of petascale supercomputers. “Ranger is partly responsible for our victory,” Grishin said, “because the more computations you run, the better your predictions are.”
Assisting Physicians
The algorithms that won the CASP competition are also applicable in the biomedical realm. As a faculty member at UT Southwestern Medical Center, Grishin is often approached by physicians and scientists who have discovered protein-based health problems in their clinical work. He applies the same proven advanced computational methods to these medical mysteries to find novel answers.
One notable success involved a collaboration between Grishin and Nobel laureates Michael Brown and Joseph Goldstein (both at UT-Southwestern) to help answer an important question regarding obesity. Brown and Goldstein were studying a hormone called ghrelin, secreted by the stomach, that stimulates appetite. For ghrelin to work, however, the “hunger hormone” needs to be modified with a lipid residue attached by another enzyme, which no one had ever identified.
“It’s very difficult to find a new enzyme. You know that the activity is taking place, but you don’t know what molecule does it," Grishin explained. “In a case like this, you can do a lot of genetic experiments to find out which protein is responsible, or you can suggest a hypothesis and try to look for it computationally. That’s what we decided to do.”
Goldstein hypothesized that the unknown enzyme belonged to a family of proteins whose members displayed some similar behaviors. Using that evolutionary insight, Grishin searched the human genome for homologs (proteins with similar structures due to ancestry) and narrowed the list of possible proteins from 30,000 down to 15. These proteins were tested in the lab, and, after 14 negative results, the enzyme responsible for priming the appetite was revealed.
“That was an important discovery because it has implications for how we treat obesity,” Grishin said. “We can now develop inhibitors to that particular protein so it won’t stimulate appetite. That was one of our collaborations where the computational approach really paid off.”
Though the protein mutations he studies may be different — heart disease, liver ailments, obesity — the goal is always the same. “The end product of this research is to cure disease. Most diseases occur either because a protein is defective, or it does something too well for its role in the organism,” Grishin said. “To cure this mechanism, we need to understand what is wrong with the protein, and for that, we need to know its three-dimensional structure. Then, we can generate hypotheses and suggest what to do next: either make a different protein, or add some inhibitor to the organism — like a drug— to prevent this disease from happening.”
Developing Methods for Widespread Assessment
One of Grishin’s key insights was the realization that sequence analysis alone doesn’t provide enough information for identifying protein similarities. A number of other factors, including structure, function, and evolution, need to be considered to make the search more robust.
Adding factors leads to better solutions, but it also turns the problem into a numerical nightmare. With almost 500 interacting parameters, Grishin’s robust method had an optimization problem that only a supercomputer could solve.
The world of protein structures at a glance. Representative proteins (dots in C) are mapped on 2D plane by the geometric similarity of their structures. The resulting distribution of structures (A,B) shows both discreteness and continuity: the protein space is organized in distinct highly populated "mountains", with sparsely populated "valleys" in between. The mountains correspond to major prototypes of thermodynamically stable protein structures (classes and folds), with a few connections between them [3]. [see below for references]
Turning to Ranger, Grishin parameterized the multi-faceted search framework, determining the weight of each factor being considered. He then thoroughly tested the methodology and workflow and reduced the algorithm to such a small-scale that non-HPC users, as well as those with access to larger systems, could use it.
By doing the compute-intensive work in advance on Ranger (using more than 250,000 computing hours on this project, and several million more for other experiments), Grishin is able to multiply the impact of his research a thousand-fold by offering his framework freely to researchers anywhere in the world. When Grishin’s 30-node COMPASS server goes online at UT Southwestern in July, scientists and physicians will be able to send their proteins for remote computational analysis and structure prediction. Even those with limited computational experience will be able to derive insights from Grishin’s methods, and the pool of protein knowledge will grow.
Looking Backward and Forward
Grishin’s pragmatic structure prediction research also raises questions of a more existential nature: How did we evolve the incredibly diverse and complex array of proteins that exists today? And is it possible to engineer proteins that have never occurred in nature?
Nick Grishin, professor of biophysics at The University of Texas Southwestern Medical Center and a Howard Hughes Medical Institute investigator
“We now have billions of different proteins in all organisms, but if we think back, there should be a smaller number of prototypes — maybe 500 to 1000 ancestral proteins that were there when life began,” Grishin said.
Following the lineage of proteins back to their origins reveals the incredible transformations that occur when a mutation alters the structure and function of a molecule. “We see that protein structures can change in quite dramatic ways. They can rearrange, for example, from an all beta-strand structure to all alpha-helical structure with only a few mutations,” said Grishin. “That’s revolutionary for the field: the idea that protein structure can change. We found that even small mutations in a protein, like gradual substitutions, at some point cause a catastrophe, and the structure completely rearranges itself. We try to follow that backwards and see the mutation that actually causes the big structural change.”
Meanwhile, in another piece of cutting-edge research, Grishin is engineering a protein unlike any ever seen before — a challenging project that could have major ramifications for medicine. “Artificial proteins that are not similar to anything in nature might be very useful for practical applications,” Grishin explained. “The proteins could offer very good templates and scaffolds to provide functions in the organism and deliver different agents to parts of the body, or serve there in place of proteins that are defective in a particular organism.”
Having completed computations on Ranger to design several original protein structures, his team is currently trying to clone and express the proteins that they’ve designed to prove that computational prediction can be useful even for proteins that do not yet exist.
With so many diverse research projects all happening at once, Grishin appears to have a finger in every pot. But there’s a simple reason for this, he says.
“Systems biology, where you put different lines of evidence together, really results in transformative outcomes,” Grishin said. "You can’t figure these problems out with just one little project. You need to have many different types of information put together to analyze proteins and attack on all possible fronts.”
With Ranger at his side, Grishin is leading that assault.
******************************************************************************************************************************
For more information, see the July 2009 volume of Proteins: Structure, Function and Bioinformatics, which will be devoted to the CASP competition; and the Web Server issue of Nucleic Acid Research in July 2009
References:
1. S. Raman, R. Vernon, J. Thompson, M. Tyka, R.I. Sadreyev, J.Pei, D.Kim, E.Kellogg, F.DiMaio, O.Lange, L.Kinch, W.Sheffler, B.Kim, R.Das, N.V. Grishin and D. Baker. Structure prediction for CASP8 with all-atom refinement using Rosetta3. Proteins (2009), in press.
2. Shuoyong Shi, Jimin Pei, Ruslan I. Sadreyev, Lisa N. Kinch, Indraneel Majumdar, Jing Tong, Hua Cheng, Bong-Hyun Kim, Nick V. Grishin. Analysis of CASP8 targets, predictions and assessment methods. Database: The Journal of Biological Database and Curation (2009), in press.
3. R.I. Sadreyev, B. Kim, and N.V. Grishin. Discrete - Continuous Duality of Protein Structure Space. Curr. Opinion Struct Biol (2009), in press.
Aaron DubrowTexas Advanced Computing Center
Science and Technology Writer