NCSA plays key role in digital archiving project

At the University of Illinois' Spurlock Museum, visitors can see tens of thousands of artifacts from six continents that span 1 million years of human culture and history. The collection includes Mesopotamian cuneiform tablets, Amazonian bark cloth, and Merovingian bronzes, all carefully preserved. A wealth of information is available for each of the more than 43,000 pieces: description, location of origin, date, dimensions, etc. Items are displayed for the public and are accessible for research and education. Now imagine trying to preserve digital information with the same care. How can the diversity of digital files—words, pictures, music, webpages, government documents, and scientific data in multitudinous formats—be captured? Where and how can it be stored? How can you ensure that it continues to be accessible as technology evolves? How can you make sure the information about the information—so-called "meta-data"—remains intact and communicates clearly? And how do you keep pace with the tremendous sea of digital data being generated? These are the many challenging questions being tackled by digital preservation experts, including those at the University of Illinois. The University is one of the lead institutions for the National Digital Information Infrastructure and Preservation Program (NDIPP), a massive Library of Congress effort to save at-risk digital materials nationwide. The University's projects are led by John Unsworth, dean of the Graduate School of Library and Information Science, and Beth Sandore, associate university librarian for information technology planning and policy. Campus collaborators include WILL-AM, -FM and -TV, the Division of Management Information, and NCSA. At NCSA, the Digital Library Technologies group focuses on the semantic archiving aspect of NDIPP, collaborating with a team led by library and information science research associate professor Dave Dubin. The goal of this effort is to build a proof-of-concept semantic archive to demonstrate how semantic inference capability could help next-generation archives head off long-term preservation risks. Current repository systems preserve the structure of information, not its meaning or semantics. When content is moved from one system to another, this structure may be subtly or unsubtly transformed. To meaningfully preserve digital content over time, it's necessary to infer meaning or semantics from structures that change over time. Given the incredible and ever-increasing volume of digital data being created, automated tools for this task are essential. Dubin has developed software, called BECHAMEL, that flags possible points of information loss or confusion to reduce long-term preservation risks. BECHAMEL emerged from joint effort by researchers at the University of Illinois, the University of Bergen, and the World Wide Web Consortium to develop a research platform for the interpretation of structured digital documents. NCSA is working with Dubin's team to scale BECHAMEL from the research lab to a production environment. NCSA's Joe Futrelle says there are many ways in which meta-data can be lost, incomplete, or misleading. The way archive frameworks are built and the instructions for preserving meta-data can sometimes lead to preservation risks. "It's possible for each step to be correct but to yield an incorrect result because of built-in ambiguity in the specifications," he says. For example, the meta-data for a digital file—a photo or map or document—might include a field called "creator." Putting a name like "John Smith" in this field might seem sufficient, but does that really identify the creator of the information? In 50 years will a future researcher be able to pinpoint which of the world's many "John Smiths" created the information? BECHAMEL flags risks like that one, or such as numerical values that aren't accompanied by error ranges. NCSA's work includes integrating BECHAMEL with semantic Web languages (like RDF and OWL) that are designed to enable a variety of software to exchange unambiguous descriptions of real-world entities. This means BECHAMEL can communicate the preservation risks it pinpoints in a portable, standard form. NCSA's Tupelo software acts as a bridge enabling these risks to be published to semantic Web databases. "Even over time within a single institution, data and meta-data have to move from one environment to another, and if we can help integrate tools like BECHAMEL into the process, we can catch and in some cases prevent preservation risks before information is lost," says Futrelle.