Microsoft Research Delivers Tools to Help Accelerate Scientific Discovery

Workflow technology for scientists connects raw data to computing systems, facilitating research in oceanography, astronomy, environmental science and other disciplines.

Addressing an audience of prominent academic researchers today at the 10th annual Microsoft Research Faculty Summit, Microsoft External Research Corporate Vice President Tony Hey announced that Microsoft Corp. has developed new software tools with the potential to transform the way much scientific research is done. Project Trident: A Scientific Workflow Workbench allows scientists to easily work with large volumes of data, and the specialized new programs Dryad and DryadLINQ facilitate the use of high-performance computing.

Created as part of the company's ongoing efforts to advance the state of the art in science and help address world-scale challenges, the new tools are designed to make it easier for scientists to ingest and make sense of data, get answers to questions at a rate not previously possible, and ultimately accelerate the pace of achieving critical breakthrough discoveries. Scientists in data-intensive fields such as oceanography, astronomy, environmental science and medical research can now use these tools to manage, integrate and visualize volumes of information. The tools are available as no-cost downloads to academic researchers and scientists at

"Today, scientists can collect more data than ever before from the Internet, satellites, sensors and other resources," Hey said. "That deluge of information brings amazing research opportunities, but at the same time, our ability to process that data and make it meaningful has not kept pace. These tools help simplify the data-intensive end of research, so scientists can focus on analyzing results and making new discoveries."

Transforming a Discipline

Project Trident is allowing oceanographic researchers to manage the massive amounts of scientific data coming in from sensors, instruments, moorings, robots and cameras attached to fiber-optic cables on the ocean floor. The data will be used to better understand sediment flows, changes in temperature and salinity, earthquakes, undersea volcanoes, extreme life forms associated with seafloor hydrothermal vents, and what data is needed to predict tsunamis.

Project Trident is currently being used by oceanographers at the University of Washington to support the Ocean Observatories Initiative (OOI), a seafloor-based research network sponsored by the National Science Foundation with thousands of sensors in the oceans of the Western Hemisphere. The amount of data coming in from these sensors is roughly equal to two simultaneous high-definition TV broadcasts going around the clock.

Project Trident is also being used by oceanographers at the Monterey Bay Aquarium Research Institute to support a data portal for a program funded by the Office of Naval Research designed to better understand typhoon intensification.

"In the ocean sciences we routinely work with complex multidisciplinary data sets, and the investigator often spends more time on the mechanics of finding and manipulating data than on the process of understanding what the data means," said James G. Bellingham, chief technologist, Monterey Bay Aquarium Research Institute. "Trident's workflow framework provides a graphical environment that hides much of the complexity from the user, letting scientists focus their intellectual energy on the data rather than the software."

In addition, astronomers at Johns Hopkins University are using Project Trident to support the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) project, which helps detect objects in the solar system that might pose a threat to Earth. The Pan-STARRS project uses an array of very powerful digital cameras to observe the entire night sky several times each month. Each of the cameras captures 1.4 gigapixels -- 200 times the resolution of a 7-megapixel consumer camera.

"This is an amount of raw data so large it's difficult to comprehend, much less work with," said Alex Szalay, Alumni Centennial Professor at Johns Hopkins University. "With Project Trident, we can essentially digest that tremendous data source directly into our supercomputers customized for data-intensive science, process it interactively and create complex statistical analyses to help us better understand what's going on in the universe."

Harnessing Technology for Science

Project Trident was developed by Microsoft Research's External Research Division specifically to support the scientific community. Project Trident is implemented on top of Microsoft's Windows Workflow Foundation, using the existing functionality of a commercial workflow engine based on Microsoft SQL Server and Windows HPC Server cluster technologies. DryadLINQ is a combination of the Dryad infrastructure for running parallel systems, developed in the Microsoft Research Silicon Valley lab, and the Language-Integrated Query (LINQ) extensions to the C# programming language. Dryad was designed to simplify the task of implementing distributed applications on clusters of Windows-based computers. DryadLINQ is an abstraction layer, which simplifies the process of implementing Dryad-based applications.

The DryadLINQ system automatically and transparently translates and executes the queries on large compute clusters using the Dryad execution engine. A DryadLINQ program can be written and debugged using standard .NET development tools, and it makes distributed computing on large clusters simple for most programmers.

Reducing Research Overhead

Project Trident combines gaming graphics with workflow technologies to create a powerful visualization tool that makes large-scale, complex scientific data not only easy to review and analyze, but also easy to manage, reproduce and share. It enables researchers to build experiments that formerly required heavy involvement from computer scientists. To give the solution enough "horsepower" to process very large data sets, Dryad and DryadLINQ allow Project Trident to be run on distributed systems or large compute clusters.

"With the addition of DryadLINQ, our ability to interpret data has finally caught up with our ability to collect it," said Roger Barga, a Microsoft researcher and principal architect for the new tools. "While it is not necessary to couple Project Trident with Dryad, the combination provides a powerful system for processing very large volumes of data."

The marriage of visualization and workflow technologies allows data analysis experiments to be developed visually as "workflows," similar to process workflows used in the business world. Whereas building such a system has traditionally required custom coding and weeks or months of development time, with Project Trident, senior researchers can do much of that upfront programming themselves in just hours or days.