BIG DATA
Managing the Data Deluge
- A new era of data-driven scientific discovery is on the horizon. To enable these investigations, TACC has added a powerful data-collection and applications system, nicknamed Corral, to house and make accessible massive digital collections relating to science, engineering, and the humanities.
- The University of Alaska’s Museum of the North usesCorral to archive and share plant specimens from its Herbarium's vast collection. These specimens, which shed light on the effects of climate change, are now much more widely available than ever before.
- With over 1.2 petabyte of storage, data-transfer speeds up to 6 gigabytes a second, and direct connectivity to TACC’s other computational resources,Corral puts TACC on the cutting-edge of data-driven science.
In the 2007 article, “The End of Theory," Wired editor-in-chief, Chris Anderson, predicted that in the near future, computers would cull immense databases of observational measurements to produce new insights about the natural world. This data deluge, he argued, would make the scientific method obsolete.
“Instead of coming up with theorems and running the simulations to see if they actually fit the observations, you take very large amounts of data collected from the real world and perform analyses on that data to derive the mathematics behind it, or apply statistical methods to attempt to understand various phenomenon,” explained Chris Jordan, senior operating systems specialist, responsible for data infrastructure at the Texas Advanced Computing Center (TACC).
This kind of discovery requires a central repository for massive data collections — satellite scans of Earth, digital negatives of plants — that can be accessed and manipulated, compared and reviewed, easily, and from any location.
TACC’s newest resource, Corral, which debuted on March 31, 2009, is just that.
With 16 Dell server nodes and 1.2 petabytes of DataDirect Networks storage — twice the space required to hold the entire Netflix DVD catalogue, and four times larger than any current data-collection resource on the TeraGrid — Corral will effortlessly handle the challenges and opportunities of data-driven science.
“We’re ahead of the curve in terms of providing this kind of dedicated data collection and application resource,” Jordan said. “A lot of other sites are doing data collections, but very few sites are providing this kind of universally accessible, unified resource.”
Besides the drive for new, data-driven scientific methods, there were several other reasons to build Corral, says Jordan. First, larger and more sophisticated computing and visualizations systems, like Ranger, Lonestar and Stallion, create tremendous amounts of data that needs to be properly stored and managed. Second, traditional data-collection repositories, such as museums and physical archives, are being revamped and re-imagined for the 21st century. Corral addresses the need for the digital preservation of important documents and specimens, while allowing archives to share their materials more broadly than ever before.
With data-transfer capabilities approaching the blazing rate of six gigabytes per second, Corral adds a dynamic new element to TACC’s high-performance computing arsenal. Corral's built-in services and software tools allow researchers to derive the greatest insights from their data, and the system’s direct connectivity to TACC’s other computational resources, including Ranger(one of the most powerful supercomputers in the world), and Stallion (the world’s largest tiled visualization display), limits time wasted in data transfer.
Since Corral went online in March, six diverse projects, from archeology to engineering, have begun using its capabilities to extend, and even generate, new scientific applications.
“The advantage of having Corral is that we have the ability to offer services based on new methodologies, and to support them in a very flexible way. This gives us the opportunity to learn what some of the best practices are and share that information with a wide variety of projects.”
Chris Jordan, senior operating
systems specialist at TACC
Alaskan Archives — Plants, Whale Songs and Dinosaur Fossils
One of those projects, the Alaska Herbarium (ALA), developed by the University of Alaska’s Museum of the North, preserves and circulates plant specimens from the region. Through a grant from the National Science Foundation, the museum is in the process of digitizing all of their specimens.
“As you may be aware, Alaska is fairly difficult to get to,” said Steffi Ickert-Bond, assistant professor of Botany and curator of the University of Alaska Museum of the North Herbarium. “Even within Alaska it’s quite hard for rural communities to come to Fairbanks. In our effort to digitize the specimens and make them available over the Internet, we hope to engage researchers from all over the world to see what we have in the herbarium.”
230,000 specimens of Arctic bio-diversity have been diligently collected over the decades. However, until recently, the collection was confined to the museum’s home archive.
With over 1.2 petabyte of storage, data-transfer speeds up to 6 gigabytes a second, and direct connectivity to TACC’s other computational resources, Corral is built to handle data-intensive problems.
“Data storage had been a real problem for us,” Bond admitted. “When we started this project, an open web repository of biological images called MorphBank was going to support all of our images at the Florida State University in Tallahassee. But they weren’t prepared for the masses of images that we were producing. It became apparent very quickly that they couldn’t handle our storage and web-accessibility requirements.”
The specimen files — high-resolution photo negatives required for fine-grained comparison — were large, and the museum needed not just a site to store the data, but a way to make the images available to the scientific community.
Enter Corral, the only system currently capable of doing the job. “We came up with a solution through the TeraGrid, of which TACC is a partner, and they were just incredible,” said Bond. “We’re now able to serve all of our images online, both in digital negative format and as jpegs, with tape backup storage at the San Diego Supercomputing Center. Within seconds of taking an image, we’re able to store the image and make it available for viewing by the public. It’s been a fruitful collaboration.”
As a unique plant record spanning decades, the herbarium’s collection serves as an important benchmark for global climate change, drawn upon by researchers around the world. “We’re already experiencing some of these changes,” Bond said. “There is an up-migration of shrubbery that is effecting native flora; invasive plants are spreading. Because of climate change, we have had an influx of certain pests that are affecting our native flora. It’s happening all over, and we’re in danger of not even knowing what we’ve lost.”
Only six months into the collaboration, all of the herbarium’s digitized materials are archived on Corral and available to researchers via the Arctos portal, which pulls the files across TACC’s high-bandwidth network.
Oxytropis vassilczenkoi subsp. substepposa Jurtzev, Western Beringian endemic legume (Fabaceae) collected in Russia, Western Chukotka, Mount Nagleynyn, SW shore of Chaun Bay, Vetrechnyi Creek (4-5 km south of the mountain), 06 Jul 1968 (link to high-resolution version)
In addition to its extensive Alaskan plant collection, the Herbarium has the largest collection of Russian specimens outside of Russia.
Getting its large archives to a secure and stable location is an important first step for the herbarium. Ultimately, using database software currently available on Corral, researchers will be able to do complex data-driven analysis based on the archival files, for instance comparing specimens from different geographic and temporal locations to map how climate change is affecting the Arctic.
Based on the successful relationship between TACC and the University of Alaska, Corral is also hosting a large collection of killer whale songs, and is developing a partnership with University of Alaska Museum of the North to digitize and host the specimens from a database of Arctic dinosaur fossils.
Bigger Deluges of Data
At approximately five terabytes, the University of Alaska’s digital datasets are some of the smaller collections on Corral. The big players, whose funding commitments made Corral possible, are The Center for Predictive Engineering and Computational Sciences (PECOS) project from the Institute for Computational Engineering and Sciences (ICES) at The University of Texas at Austin, and research at the Center for Space Research, also at The University of Texas at Austin [see descriptions below].
These projects anticipate using more than 100 terabytes of storage each and are greatly helped by the fact that Corral is directly linked to TACC’s large computational and visualization systems.
“You can bring your data into Corral and do some computation using Ranger, perform some further analysis using Spur, and then visualize the data at the Vislab, without ever moving any of the data around,” Jordan said.
TACC expects that within two years, Corral, the largest of its kind at a supercomputing center, will be completely full. For that reason, the center designed the system so it could be doubled, tripled or made ten times larger.
With its suite of high-performance and data-intensive systems in place, Jordan believes TACC is well positioned to be at the forefront of advances in data-driven applications and analysis.
“The advantage of having Corral is that we have the ability to offer services based on new methodologies, and to support them in a very flexible way,” said Jordan. “This gives us the opportunity to learn what some of the best practices are and share that information with a wide variety of projects.”
• PECOS Engineering Simulation Project, The University of Texas at Austin – The Center for Predictive Engineering and Computational Sciences (PECOS) is a new Department of Energy-funded Center of Excellence within the Institute for Computational Engineering and Sciences at The University of Texas at Austin. The PECOS project will develop the next generation of advanced computational methods for predictive simulation of multiscale, multiphysics phenomena, and apply these methods to the problem of reentry of vehicles into the atmosphere. PECOS hopes to advance the science and modeling of atmospheric reentry and the science of predictive simulation. Corral will be used to process, manage and store the images and other data generated by the project, and will provide high-speed access to this data for researchers and members of the public anywhere in the world. More information available at: http://www.ices.utexas.edu/centers/pecos/
• Herbarium Digitization, The University of Alaska Museum of the North – One of the world's premier collections of arctic and boreal plants. With support from the National Science Foundation, the Herbarium is taking high-resolution digital photographs of 230,000 pressed plants to capture data about the collection and to make these specimens more accessible for research and education. The images are archived as digital negatives, the most data-intensive file format, preserving all of the data captured by the camera. Making these images publicly available requires four terabytes of rapidly accessible Web storage. Corral will be used to process, manage and store the digital images and other data generated by the project, and will provide high-speed access to this data for researchers and members of the public anywhere in the world. More information available at: http://arctos.database.museum/uam_herb
• Center for Space Research (CSR), The University of Texas at Austin – CSR will use Corral for two important space-based projects -- imagery data and geospatial data for emergency response operations, and high-precision gravity data processing. As part of CESAR (Cyberinfrastructure for Emergency Situation Assessment and Response), Corral will be used to rapidly access the ‘framework’ geospatial data needed for emergency response operations during natural and man-made disasters. Framework data are the most recent, high-resolution aerial and orbital imagery and elevation data sets. CSR will also use Corral to store the data sets collected during a major event, such as Hurricane Ike, for distribution to state and federal agencies, and universities performing disaster research.
The Gravity Recovery and Climate Experiment (GRACE) is providing a continuous, multi-year record of the spatial and temporal variations in the Earth’s mass through measurements of its gravity field, and has provided new insights into the evolution of the Earth's climate system. The group expects to collect a few terabytes of original data and 20 to 40 terabytes of analysis results. Corral will house the data online for rapid mission reprocessing and scientific analysis. In addition, Corral will host the output products online for analysis of multi-year data sets. More information available at: http://www.csr.utexas.edu/grace/
• Institute for Classical Archaeology (ICA), Liberal Arts, The University of Texas at Austin – ICA will use Corral to preserve, protect and disseminate two dynamic datasets to the wider academic community and the public. The first dataset contains information gathered during an intensive field survey of ancient sites in the territory of Metaponto in South Italy where data were documented using GPS and incorporated with remote-sensing imagery into a geographic information system. The second dataset involves excavations in an area of the Greek, Roman and Byzantine city of Chersonesos in Crimea (Ukraine). These spatial and contextual datasets also contain extensive data produced in the course of specialist research into forensic anthropology and ancient agriculture and technology. More information available at: http://www.utexas.edu/research/ica/
• The Hobby-Eberly Telescope Dark Energy Experiment (HETDEX), The University of Texas at Austin – The HETDEX project at McDonald Observatory is the first major experiment to probe dark energy, the mysterious force causing the expansion of the universe to speed up over time. Over three years, HETDEX will collect data on at least one million galaxies that are nine billion to 11 billion light-years away, yielding the largest map of the universe ever produced. The map will allow astronomers to measure how fast the universe was expanding at different times in history. The project will generate several tens of terabytes of data in a realm previously unexplored by astronomers of which the project will use a small fraction. TACC will archive the dataset for use by the wider astronomical community, and provide a public Web portal. More information available at: http://hetdex.org
Some of these data collections are as small as five terabytes, while some are as large as 100 terabytes.
Steffi Ickert-Bond's work is supported by grants from the National Science Foundation Division of Biological Infrastructure.
Aaron Dubrow and Faith Singer-Villalobos
Texas Advanced Computing Center
Science and Technology Writer