Project to Improve Researchers' Access to Water Data

A collaboration among Microsoft, the U.S. Department of Energy’s Lawrence Berkeley National Laboratory and UC Berkeley is underway to develop a scientific data server for amassing and organizing water data from diverse sources, a system that will accelerate research in the increasingly important areas of water supply and climate change. Called Microsoft e-Science, the project is part of the Berkeley Water Center’s effort to marshal expertise from public institutions and the private sector and support projects that enable researchers to easily access and work with water data. The 1-year-old center is the brainchild of Berkeley Lab’s Computational Research Division (CRD), UC Berkeley’s College of Engineering and UC Berkeley’s College of Natural Resources. Local, state and federal governments have long collected detailed information about water supplies, such as measuring the river flows and water content in Sierras’ winter snows. They use the data to make allocation decisions for farms, businesses and residential consumers. However, these agencies use different methods to collect and archive the information, posing a challenge for scientists who need to retrieve and integrate all those data sets in order to carry out comprehensive analyses. The e-Science project strives to ease that headache for scientists. The project team already has developed a prototype data server, which runs on Microsoft SQL Server 2005. The team is now testing the system by loading data about Northern California’s Russian River watershed from myriad agencies. “Because of the differences in the data, the loading of each data file presents a new challenge, and matching data across different data sets is difficult,” said Deb Agarwal, head of CRD’s Distributed Systems Department and the Berkeley Water Center’s IT Advisor. “There is a perception that once the data is in an archive, science is enabled on a grand scale. But data availability is only the first step in the process.” Agarwal and other project researchers already have demonstrated the prototype server to the scientific community. For example, at the FLUXNET Synthesis Workshop in Italy this month, project team members Matt Rodriguez, a CRD scientist, and Catharine van Ingen from Microsoft ran the data server to show how scientists could find and plot cross-network data in minutes, rather than days. The data to be analyzed was 400 site-years of data, most of which had not been used before in cross-site analysis. For the first time, workshop participants could spend more time exploring the data rather than collating them. At the European Geosciences Union General Assembly 2007 in April, Agarwal, van Ingen and Dennis Baldocchi from UC Berkeley will discuss the server and their support of its users in a paper titled, “A Next Generation Flux Network Data Server.” Microsoft’s support is critical for the project because approximately 90 percent of the researchers accessing these data archives are working on Windows-based desktop computers. Van Ingen brings expertise from her work as an engineering professor and software expert, as well as a Microsoft insider who knows where to turn for help in the company. Developing the prototype server was an important milestone for the project. To build it, the project team started with the data archive of the AmeriFlux network of 149 research towers located around the Americas. Using arrays of sensors, the towers provide continuous observations of ecosystem-level exchanges of CO2, water and energy, essentially recording how the ecosystem “breathes.” The AmeriFlux archive currently contains 192 million data points stored as hundreds of flat files. Researchers analyzing this data currently download a copy of the data for use in local analysis. Since the data is continually being updated and corrected, each researcher typically ends up with a different version of the data. Working with Van Ingen and another expert from Microsoft, Stuart Ozer, Agarwal and her staff, Rodriguez and Monte Goode, designed the server to make the AmeriFlux data easier to use. The approach incorporated a database and a “data cube,” a type of database structure optimized for data mining. “Although the long-term plan is to develop the server for use in gathering and understanding watershed data, using the AmeriFlux archive data gave us an excellent environment to use as we designed the server,” Agarwal said. At the annual AmeriFlux meeting last October, Gretchen Miller, a UC Berkeley graduate student working with the BWC, demonstrated an AmeriFlux data cube, and Agarwal reported on the data server project. The presentation was followed by a discussion of the project’s next steps and a request for the e-Science team to support the FLUXNET meeting. While developing the server is a major part of the project, the long-term goal is to develop a portable system that can be maintained by the researchers themselves. “Right now we’re at the edge of computer science and research, where we are developing tools that we hope will make this data server a natural research tool, a kind of ‘collaborative data server in a box,’ for science,” Agarwal said. Learn more about the Berkeley Water Center at its Web site. The Distributed Systems Department is part of the Computational Research Division at LBNL. Berkeley Lab is a U.S. Department of Energy national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California. Learn more at its website.