CTC Aids Social Science Researchers

A two-year National Science Foundation (NSF)-funded cyberinfrastructure project (2006-2008) entitled “Very Large Semi-Structured Datasets for Social Science Research” is being supported by the efforts of the Cornell Theory Center (CTC). CTC is an interdisciplinary research center at Cornell University focused on providing cyberinfrastructure resources for research and education. The project is also part of a three-year Cornell-funded initiative entitled "Getting Connected: Social Science in the Age of Networks." A team at CTC is developing software cyberinfrastructure tools to obtain and examine information from large internet data collections. The eruption of electronic and internet communication networks has created vast amounts of data that hold enormous potential for basic and applied investigations in the social sciences that has thus far been largely untapped. In phase one CTC is building software cyberinfrastructure tools to copy and rearrange the "Internet Archive" Web site collection of Web "snapshots" taken every two months since 1996. The archive will be reconfigured as a database that will make very large on-line network data accessible to social science researchers at Cornell and elsewhere. Cornell researchers involved in the endeavor include computer science (CS) professors William Arms, Jon Kleinberg, Daniel Huttenlocher, and Johannes Gehrke, Associate Director of CTC. Professor Michael Macy, department chair of sociology is also on the project team, along with David Strang, also in sociology, and Geri Gay, chair of the communication department and professor of information science in computing and information science (CIS). The Internet Archive is a non-profit organization started by Brewster Kahle that is preserving a record of the internet by capturing snapshots of 55 billion web pages. CTC will transfer these pages from the Internet Archive servers to a computer server at CTC. The plan is to have 30% of that data (about 200 terabytes) transferred by 2008. As the data streams from the Internet Archive to the server it passes through a parsing pipeline developed by the research team and the Theory Center. The pipeline is set up to separate out the URL information, the content, and the links. Using cyberinfrastructure tools developed by the research group and CTC, social science researchers will then be able to scrutinize and manipulate each piece of this data to study the internet networks. This will allow them to validate theoretical models and help identify new trends. As Gehrke explains, "Previously if social scientists wanted to study a village they needed to go and live with the villagers. Today they can study human behavior and interactions in a new world of data via the internet." Data CTC is transferring from the Internet Archive will allow social science researchers to study the Web as a social phenomenon. How does the Web play a role in the diffusion of ideas and innovations? The spread of urban legends? As a source of information about contemporary social events? Previously these studies were only able to be based on small, hand-coded samples. Use of the transferred web data allows researchers to do large-scale studies and create a highly convenient virtual laboratory or “WebLab” for the research. These tasks would not have been feasible without cybertools as the internet data exists only as an archive of individual web pages with no way to parse the data into meaningful parts and structures. Ultimately CTC’s assistance with this first phase of the investigation will allow researchers to gain insight into social networks and help them develop more advanced tools for linking computational social scientists across disciplines and organizations. In addition to using data from the Internet Archive, the Web Lab team is also using Web crawls, data collected from the Wayback Machine, and from NetScan, a usenet analysis tool developed at Microsoft Research to build smaller and more focused datasets which can be used to study specific networks such as adolescent peer networks and the relationship between individual attributes (e.g., personality, beliefs) and network position.