INDUSTRY
Webcast: From Data Deluge to Useful Knowledge: Data Sharing and Preservation with iRODS for Multi-Agency NITRD Committee
As a promising approach to "taming the digital deluge," this unusual demo reached a high-level multi-agency audience interested in the practical iRODS implementation of state-of-the-art preservation and sharing (data grid) technologies for digital data collections reaching hundreds of millions of files and petabytes of data, distributed among collaborating projects across the nation and around the world.
With the size of digital data collections expected to double in just five years, how will it be possible to organize, share, and extract useful knowledge from this deluge of data? How will it be possible to preserve this digital information for future generations, when it can disappear with the crash of a hard drive, obsolete software applications, or proliferating proprietary formats?
Interest in meeting these challenges is high, and a large crowd recently gathered at the National Science Foundation for a “Technical Demonstration of an Integrated Preservation Infrastructure Prototype,” at the invitation of NITRD, the multi-agency National Science and Technology Council subcommittee on Networking and Information Technology Research and Development.
The webcast and slides can viewed at http://irods.org/index.php/iRODS_Videos.
The demonstration, which showed how it is possible to build, share, and preserve large digital collections using iRODS, the innovative Integrated Rule-Oriented Data System, was presented by Dr. Reagan Moore of the University of North Carolina at Chapel Hill (UNC), along with the team of collaborating UK researcher Dr. Paul Watry of the University of Liverpool.
The open source iRODS Data System, developed by the Data Intensive Cyber Environments Center (DICE Center) at UNC, led by Moore, integrates a number of key capabilities that let users build sharable “virtual” data collections and preserve them long term, even if the data collections are distributed across different projects, locations, hardware and software, or even different countries.
In the demonstration, the researchers showed 11 live implementations of iRODS capabilities following digital data through its complete life cycle, from birth in research projects to valuable reference collections used by wider communities to long-term preservation environments that can keep today’s information available for tomorrow’s society to harness in as yet unimagined ways.
The steps of the data life cycle include:
- Assembling a Digital Collection from data distributed across the hall or around the world
- Using iRODS innovative Rule Engine to apply Policy-Based Management, providing the automation needed to feasibly build and use today’s massive digital collections
- Sharing data across widespread projects in a secure Data Grid or “intelligent cloud”
- Publishing digital reference collections in a Digital Library, and
- Preserving reference collections for future generations in a Trustworthy Preservation Environment.
To keep up with today’s mushrooming digital collections, the iRODS system can manage tens to hundreds of millions of files in collections containing hundreds of terabytes to petabytes of data (a petabyte is about 1,000 terabytes, equivalent to about two billion books or some 10 million trees).
The well-attended demonstration attracted representatives from multiple federal agencies. With 13 Federal agencies as members, the NITRD Program is the primary mechanism by which the Federal government coordinates unclassified networking and information technology research and development investments. These agencies work together to develop a broad spectrum of advanced information technology capabilities to power Federal missions; U.S. science, engineering, and technology leadership; and U.S. economic competitiveness, leveraging strengths, avoiding duplication, and increasing interoperability.
Because iRODS development history reflects more than a decade of user-driven applications across Federal agencies, including NSF, NARA, DOE, NASA, NIH, DOD, NHPRC, IMLS and others, the technology is practical and “real-world,” and is being adopted in projects of multiple agencies in the US as well as international projects around the world.
The researchers demonstrated iRODS, which is funded by the NSF and the National Archives and Records Administration (NARA), running on remote systems distributed across three states and the UK. The capabilities are being developed for use in projects ranging from the NARA Transcontinental Persistent Archive Prototype to the NSF Ocean Observatories Initiative, the NSF Science of Learning Centers, and the National Virtual Observatory.
A key part of both sharing and preserving data is being able to connect or “interoperate” with different repositories, projects, file formats, and data management systems. For example, as part of the demonstration, Watry’s researchers from the EU Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN) project demonstrated integration of the Cheshire3 text analysis and Multivalent preservation technology with iRODS. The integrated system provides the ability to “read” and index most older text file formats, effectively preserving and keeping information such as obsolete Wordstar word processing files alive and within reach. This opens the door to searching, mining, and using yesterday’s legacy information to extract new knowledge for tomorrow.
Additional examples of integration with iRODS include the ability to use EnginFrame as a portal for accessing a preservation environment, use of the Davis WebDAV interface developed by the Australian Research Collaboration Service to display a digital library, and use of the Monitoring System developed by the French Institute National de Physique Nucleaire et de Physique des Particules to track storage space and usage in the data management system.
The ability to connect and interoperate with other systems leverages the development investment in iRODS, extending the benefits more broadly and helping weave an integrated “fabric of knowledge” across society.
The DICE Center at UNC is affiliated with the Renaissance Computing Institute (RENCI) and the UNC School of Information and Library Science (SILS), with core iRODS development at the Institute for Neural Computation (INC) at UC San Diego.
The video stream and accompanying presentation slides for this demonstration are web accessible through the DICE Center at UNC iRODS web resource: Video: https://www.irods.org/index.php/iRODS_Videos and slides: https://www.irods.org/pubs/iRODS-demo-090804.pdf.