A Window on the Archives of the Future

: Category: SCIENCE; Published: February 7, 2011, 4:42 pm

TACC partners with the National Archives to find solutions to the federal government’s digital records challenge

How does an archivist discover relationships or find information in a sea of millions of digital records? With the proliferation of digital records, the task of an archivist has grown exponentially more complex. This problem is especially acute for the National Archives and Records Administration (NARA), the federal government agency responsible for managing, preserving and ensuring transparency in access to federal government digital records documenting our nation’s history, our democratic processes, and the rights of American citizens.

To give a sense of the scale of the problem, at the end of the administration of President George W Bush, NARA received roughly thirty-five times the amount of data as previously received from the administration of President Clinton, which itself was many times that of the previous administration. With the federal government increasingly using advanced technologies — including social media computing, cloud computing, and other technologies — to contribute to digital democracy and open government, this trend is not expected to decline and NARA has a crucial role in discovering answers to today’s archival challenges.

“The National Archives is a unique national institution and public trust responsive to requirements for preservation, access, and the continued use of government records,” explained Robert Chadduck, Acting Director for the National Archives Center for Advanced Systems and Technologies (NCAST), NARA’s lead technology organization for advanced and applied research in the fields of computer science, engineering, and archival science.

According to Chadduck, over the past decade, high performance and data intensive computing have emerged as crucial tools to address digital records challenges. To find innovative and scalable solutions to very large and heterogeneous electronic records collections, NARA joined with the National Science Foundation’ Office of Cyberinfrastructure in turning to the Texas Advanced Computing Center (TACC), drawing on the expertise of the Center’s digital archivists and technologists, including visualization and data experts.

A preservation view of the US Geological Survey Record Group including multiple file formats organized in diverse arrangements, show, in coded colors, the different preservation risk levels of the files.

“For the government and our nation to effectively respond to all of the requirements that are associated with very large digital record collections, innovative approaches and tools are needed, including those embodied in the class of cyberinfrastructure that is currently under development at TACC,” Chadduck said.

Teasing Out Meaning

In collaborating with NARA, members of TACC’s Data and Information Analysis group developed a multi-pronged approach to address technical challenges. The overall goal of their research is to investigate different data analysis methods within a visualization framework. The visualization interface is the bridge between the archivist and the analysis results, which are rendered visually onscreen as the archivists make selections and interact with the data. The results are presented as forms, colors and ranges of color to assist in synthesis and to facilitate an understanding of large-scale electronic records collections.

“Archival analysis is a multi-layered process and it is unique to each collection that is being assessed,” explained Maria Esteva, a digital archivist and data management and collections researcher at TACC. “We are conducting research to map analysis processes used by archivists onto a visualization that combines data driven analysis tools. In this way, the archivist can integrate his or her experience into the workflow.”

The first step in the project was to represent a large and heterogeneous archival collection.

“We are all familiar with desktop icons, representing folders and files,” Esteva said. “But imagine a screen clogged with millions of such icons, with little clue as to what is inside. It takes a visual representation to show millions of files at a time.”

Using test collections obtained from NARA’s Cyberinfrastructure for a Billion Electronic Records (“CI-BER”) research collaboration, targeting next generation technologies responsive to digital records requirements, TACC created a visualization based on “treemap” and relational database systems to represent the collection’s arrangement and to show its properties at different levels of aggregation, classification and abstraction. Treemaps use nested rectangles to display hierarchically-structured data, and use sizes and colors to show properties such as type of file formats, file sizes, file numbers and preservation risk. Specifically, the TACC treemap presents a view of a 3-million-file NARA test collection that allows one to identify the differences between groups of records through visual clues, and distinguish patterns and outliers that would be difficult to spot otherwise. Moreover, it allows archivists to learn by visually comparing and contrasting groups of records.

Presentation of the entire testbed collection represented as a treemap in which the archivist can assess correspondence between number of files (size of the directory) and size of files (ranges of yellow) and their distribution in directories.

Because large amounts of data cannot be comprehended at once, well-designed visualizations must provide a path that goes from an overview to a detailed perspective in order to facilitate a clear understanding of massive collections.

“A fundamental aspect of our research involves determining if the representation and the data abstractions are meaningful to archivists conducting analysis, if they allow them to have a clear and thorough understanding of the collection,” said Esteva. “For the archivist, it takes getting used to doing analysis based on abstractions and not on discrete data.”

Archivists spend significant amounts of time determining the ways in which a collection is organized so that they can describe it for access purposes. One of the data driven analysis tools incorporated in the visualization framework includes alignment algorithms and Natural Language Processing methods to discover if a group of records is organized by date, by place, or sequentially.

Results represented by colors and shown across diverse groups of records make it possible to identify organizational trends in the collection. This alignment method, developed by TACC researcher and group member, Weijia Xu, was based on his experience creating tools for biology applications. The researchers presented their findings at the 6th International Digital Curation Conference, the 2010 Joint Conference on Digital Libraries, and the 2010 E-Records Forum jointly sponsored by the NARA-Southwest Region, the Texas State Library and Archives Commission, University of Texas at Austin School of Information.

Throughout the research process, the team has sought feedback from archivists and information specialists on The University of Texas at Austin campus, and in the Austin community, in the form of user experience studies and focus group discussions.

“This research addresses many of the problems associated with comprehending the preservation complexities of large and varied digital collections,” said Jennifer Lee, head librarian for preservation and digitization services at The University of Texas at Austin. “The ability to assess varied characteristics and to compare selected file attributes across a vast collection is a breakthrough.”

Evolving Solutions

In October 2010, TACC’s researchers learned that they had received approximately $475,000 from NARA as part of a jointly supported research with the NSF Office of Cyberinfrastructure for the second year of the collaboration. The underlying technology challenges targeted in this collaboration are highlighted as among the President’s priorities for federal networking and information technology coordinated research in 2011.

Robert Chadduck, Acting Director for the National Archives Center for Advanced Systems and Technologies(NCAST), NARA’s lead technology organization for advanced and applied research in the fields of computer science, engineering, and archival science.

“TACC’s nationally recognized expertise, cyberinfrastructure, and technical capabilities constitute significant national investments,” said Chadduck. “The understanding of how such cyberinfrastructure and capabilities may effectively address digital records challenges is at the core of our collaboration with TACC.”

The data deluge afflicting digital collections needs to be addressed with data-driven tools, and methods that take advantage of, and learn from, the size and diversity of the information. However, at the same time, this data needs to be presented in ways that facilitate understanding and enrich human analysis. Neither task is simple.

TACC’s experts are currently building a multi-touch enabled tiled display system to improve interactivity and to enhance the collaborative aspects of visual analysis for multiple users. The new system will amplify the benefit of the visual analytics process by enabling multiple users to explore data concurrently while having discussions. The group is also experimenting with cloud-computing methods, using the distributed storage of TACC’s Longhorn cluster with the open-source "Hadoop" package to scale data analysis methods.

“Hadoop is an implementation of the MapReduce programming model that Google used to process and index billions of web pages,” said Weijia Xu. “Using Hadoop is a necessary step to process data effectively at the Petabytes scale.”

The collaboration between NARA and TACC is leading to the development of tools that combine the power of advanced computing with the experience and skills of archivists and data curators.

“Technology research led by TACC today is yielding results that will be eventually integrated into the cyberinfrastructure of our country. At that point these technologies researched today will become commonplace,” said Chadduck. “In that way, TACC is providing what I believe is a window on the archives of the future.”

SCIENCE