Researchers develop new way to decode large amounts of biological data

Finding could radically speed up understanding of genomic information

In recent years, the amount of genomic data available to scientists has exploded. With faster and cheaper techniques increasingly available, hundreds of plants, animals and microbes have been sequenced in recent years. However, this ever-expanding trove of genetic information has created a problem: how can scientists quickly analyze all of this data, which could hold the key to better understanding many diseases, and solving other health and environmental issues.

Now, two researchers have developed an innovative computing technique that, on very large amounts of data, is both faster and more accurate than current methods. To spur research, a program using this technique is being offered for free to the biomedical research community.

"This is a whole new approach, with multiple opportunities for further development," said Andrew F. Neuwald, PhD, Professor of Biochemistry & Molecular Biology at the Institute for Genome Sciences (IGS) at the University of Maryland School of Medicine.

A description of the new method was published today in PLOS Computational Biology. Dr. Neuwald collaborated on the work with Stephen F. Altschul, PhD, a senior investigator at the National Center for Biotechnology Information at the National Institutes of Health.

Genomic sequence data encodes information regarding the structure and function of proteins, which comprise the basic cellular machinery and thus determine the structure and function of all microbes, plants and animals.

The new program is called GISMO, an acronym for "Gibbs Sampler for Multi-Alignment Optimization". Gibbs sampling, a statistical technique for solving highly complex problems, is a central feature of the approach. In this case, sampling is used to find biological signals - relevant patterns that can help scientists better understand how organisms work. Neuwald says the approach improves upon conventional sequence alignment programs, which, unlike GISMO, can easily mistake random patterns in the data for biologically valid signals.

Current widely-used methods typically compare each sequence to every other sequence; this takes a prohibitively long time to compute for sets of a hundred thousand or more related protein sequences, which are now available for analysis. Neuwald describes these methods as "bottom up." He and Dr. Altschul developed a technique that is "top down"; instead of comparing sequences to each other, it compares each sequence to an evolving statistical model. This approach is not only faster, but is also better at finding biologically relevant signals, which can, for example, help researchers unravel the mechanisms underlying cancer and inherited diseases. This technique becomes progressively faster than other methods as the size of the data set becomes larger.

Dr. Neuwald has a varied background, in molecular biology, computer science and Bayesian statistics and has been working on this technique for years. Dr. Altschul, whose formal training is in mathematics, was the first author on two landmark publications describing the popular sequence database search programs BLAST and PSIBLAST. They confirmed GISMO's superior performance on large, diverse sequence sets by testing it against five widely used conventional methods. Dr. Neuwald is excited about GISMO's potential: "Because researchers have been finding ways to speed up and improve conventional methods for decades and because GISMO takes such a new and different approach, I am confident that we can make GISMO even faster and more accurate going forward."

Researchers develop new way to decode large amounts of biological data

From Euro 2024 to World Cup 2026: How supercomputers are turning soccer into a computational science

The next challenge for supercomputing isn’t faster AI, it’s public trust

Supercomputers trace a cosmic chain reaction from primordial black holes to the elements of life

Supercomputers challenge the origin story of cosmic explosions

IBM’s sub-1 nanometer chip breakthrough: A genuine revolution, or another semiconductor science project?

The mathematical breakthrough that could free millions of supercomputer hours

How HPC is connecting natural fusion in thunderstorms to the future of clean energy

Meta’s next frontier may not be social media; it may be supercomputing

The future of cancer research runs on supercomputers

Rebuilding a lost continent: Supercomputers reveal Antarctica before the ice

When stars fall apart: Supercomputing reveals the hidden physics of black holes

Tiny whirlpools, massive potential: How skyrmions could reshape supercomputing memory

POPULAR RIGHT NOW