SDSC Announces New IBM

The San Diego Supercomputer Center (SDSC) at the University of California, San Diego, the leading-edge site of the National Science Foundation (NSF) National Partnership for Advanced Computational Infrastructure (NPACI), has announced that it will deploy a major new data-oriented computer resource. The new machine, SDSC DataStar, will be a 7 teraflop/s IBM Regatta system (seven trillion floating-point operations per second) and will leverage SDSC's international leadership in data and knowledge systems to address the growing importance of large-scale data in scientific computing. The new system will be designed to flexibly handle both data-intensive and traditional compute-intensive applications, and will be linked to the national information cyberinfrastructure. DataStar is scheduled to be installed in the summer of 2003. The new system will offer many innovations for users. Today, data collections for astronomy, physics, and other disciplines have reached terabyte size (trillions of bytes), and will grow to petabytes (1,000 times larger) in just the next few years. High-end computers are typically not configured so that users can easily compute with and move large data sets into and out of the machine, and this forms a significant impediment for scientists in extending their simulations and analysis to the largest scales. To help data-intensive users, DataStar will be specifically designed to host high-end, data-oriented computations, and will be integrated with SDSC's Storage Area Network, or SAN, which will provide 500 terabytes of online disk, and the six petabyte capacity High Performance Storage System (HPSS) for archival storage. DataStar will also be linked through the national information infrastructure grid to a wide spectrum of other resources. To ensure the optimum use and usefulness of DataStar, SDSC is planning two national data meetings. The events will bring together scientific users of data-intensive computing resources with vendors and other experts in computing and data technologies to develop an optimum machine configuration based on application needs and requirements, and to help develop innovative administration and allocation approaches. The initial meeting of experts, to be held in May, 2003, will focus on machine configuration -- balancing processors and node configuration with memory, disk, and I/O, as well as user-friendly administration and allocation -- to make the machine highly usable for data-oriented, grid, and compute users. The May meeting will be followed by an intensive National Data-Intensive Computing Workshop in August for new and current users who want to get their applications running on DataStar, to increase the size of their research problems, address new capabilities, and more. "DataStar is being designed to be the best resource for data-oriented computing on the planet," said Fran Berman, director of SDSC and NPACI. "The configuration and administration of the machine will be user-driven to make it maximally effective for the community, and we expect important new science to happen on it. DataStar will be linked to the TeraGrid/ETF (Extended Terascale Facility), the NPACI Grid, and the emerging national information Cyberinfrastructure to enable a broad user community to create key advances for science and society." The new system will support SDSC's role as the data-intensive site in the NSF-funded Extensible Terascale Facility (ETF), which includes SDSC, the National Center for Supercomputing Applications (NCSA) in Illinois, and the Pittsburgh Supercomputing Center (PSC), as well as Caltech and Argonne National Laboratory. The ETF is extending the TeraGrid, a multi-year effort to build and deploy the world's first large-scale and production grid infrastructure for open scientific research. TeraGrid/ETF is linked by the world's fastest networks and is being configured to function as a unified grid system. The diversity of users of high-performance computing is growing, with new data-intensive applications and new forms of access such as Web-based portal and on-demand computing services. These developments are driving different modes of allocation based on data, portals, and other services, and new allocation models are being considered for DataStar. "By seeking user input on applications, configurations, and allocations for DataStar, we are initiating a process that will enable both current and new user communities to more effectively use this integrated computing and data resource to advance science," said Richard Moore, Executive Director of NPACI. "The unique flexibility of DataStar will come in part from a mix of large and small nodes that will make it possible for users to do crucial pre- and post- processing right on the same file system, without having to move jobs to separate machines," said Phil Andrews, Program Director for High-End Computing (HEC) at SDSC. "Combined with the 64-bit processor architecture that allows much larger data sets in memory, enhanced I/O, and half a petabyte of attached disk, the new resource will offer unique capabilities for data-intensive computing." Chaitan Baru, co-director of SDSC's Data and Knowledge Systems (DAKS) program, comments that "for data-intensive applications, you're not just thinking of processors alone, but the combination of processors, memory, and attached disk that will make that application work best." Early applications of the data machine will include the National Virtual Observatory (NVO) project, which is enabling new science by offering astronomers the ability to rapidly search and analyze discipline-wide image collections of the entire sky in multiple wavelengths. "The data-oriented focus of this new resource will have a large impact on the growing number of communities managing and manipulating large collections," said Reagan Moore, co-director of SDSC's DAKS program. "In addition to traditional high-performance computing capabilities for data analysis and data-mining, we're integrating through data grid technology end-to-end capabilities extending from online publication of data sets for sustained community access to preservation and archiving of crucial data." Fran Berman sums it up, "DataStar will lead the way for science-driven data-oriented computing at the highest level and will foster new paradigms and scientific advances, taking science to the next step in an immensely exciting way."