SC04 Challenge: Do Real Science with Cycle Scavenging

The Cornell Theory Center (CTC) is sponsoring a contest at SC04 to see who can calculate the most protein structure alignments using a cycle-scavenging application provided on our Web site at http://www.tc.cornell.edu/sc2004/biocontest/ or at our booth on the show floor. Identified by email address, the user who performs the most work will earn a Toshiba tablet PC. Cluster users are welcome to participate, although they are not eligible to win. “The scope of the whole problem [described below] would require significant computational resources,” said Senior Research Associate Jarek Pillardy. “Wrapping the original code in C# to use Web services, we have taken advantage of the problem’s highly parallel data decomposition to gain access to available desktop resources.” The problem requires the binary comparison of each subchain in 27,000 protein structures. On average, a single comparison takes less than five minutes and is the task unit for cycle-scavenging. A Windows Forms application is used to retrieve tasks and protein structures from a Web service running in Ithaca, New York, on CTC’s servers. That Web service uses ADO.NET to contact a SQL Server database that manages tasks and results. This well-known Web services architecture allows rapid transformation of a traditional code and an enormous problem into a simple cycle-scavenging solution. To enter the contest, visit http://www.tc.cornell.edu/sc2004/biocontest/ or stop at CTC in booth #2235. Directions for the contest are available on the Web. To compete, contestants can choose to complete one job at a time or process continual work, all with the click of their mouse. A second click will end the process. The Science: Protein Structural Alignment in the Post-Genomic Era One of the computational challenges in the post-genomic age is determining the shape of proteins, which are the products of genes. In recent years a vast body of genomic data has been accumulated, yet understanding their function is mostly unknown. The three-dimensional structure of a protein is one of the clues that enables scientists to decipher its function. Unfortunately predicting protein structure from the sequence is not an easy task. One of the common techniques used is comparison of a proposed shape to a database of known shapes. This technique is called structural alignment, which is a valuable tool for comparison of proteins in the so-called “twilight zone,” where the relationship between proteins cannot be detected through sequence alignment. These proteins have gone through many mutations, insertions and deletions during their evolution and are no longer recognizable through their sequence identity, but often their structure is fairly well preserved. A good example is a comparison of myoglobin (1mba) and leghemoglobin (1bin:A): they share only 14.3% of sequence identity, while their structures are remarkably similar (average deviation of 2.9Å, see Pict. 1). Protein structure comparison plays a key role in protein structure prediction, fold family classification, motif finding, and phylogenetic tree reconstruction and more. The biggest database of protein structures is the Protein Data Bank (PDB); it contains approximately 27,000 structures (as of September 2004). For many applications, it is essential to know the similarity of these structures, i.e. to cluster the database by structural alignment. However, in order to do so, one needs to compute the structural comparison for all different pairs of proteins from the database, and there are approximately 3.6×108 of such pairs. Even at a speed of 1 second per pair calculation it requires 100,000 hours (or 13 months) of calculations. The average speed of pair comparison at present is about 30 seconds, which leads to an estimated 3 ½ years of calculations. A structure comparison procedure superimposes fragments of two proteins’ atomic coordinates so that their root mean square deviation (RMSD) is minimal. Choosing the right pieces is a combinatorial problem. One of the best algorithms known is Combinatorial Extension (CE) developed by Shyndyalov and Bourne (http://cl.sdsc.edu/ce.html). The algorithm involves a combinatorial extension of alignment path defined by a sequence of aligned fragment pairs. The fragments are a fixed length chosen according to local geometry. A combination of fragments with gaps represents a possible continuous path. A fragment path is selected or discarded according to several distant measure filters. The longest alignment path is next evaluated for statistical significant. This is done by evaluating the probability of finding the alignment path of the same length with at most the same number of gaps by chance. Finally an optimization procedure is performed for an alignment with the score above a threshold, which results in up to 2Å improvement in the RMSD measure. The optimization is done by relocating gaps up and down a fragment.