An Interview with ISC2003 Keynote Presenter Dr. Jim Gray, Microsoft

Dr. Gray is a specialist in database and transaction processing computer systems. At Microsoft Bay Area Research Center, San Francisco, he focuses on scaleable computing: building super-servers and work-group systems from commodity software and hardware. In this interview he discussed some topics concerning the role databases play in the commercial and the scientific world. An other aspect is the usage of the cost-effective Microsoft SQL Server in academic institutions. Q: In your biography I read that you are specialist in data and transaction processing computer systems. Are you coming from the commercial field and what have been your tasks there? A: I have worked primary in the areas of databases and transaction processing – primarily focusing on the plumbing issues like defining what a transaction is, implementing concurrency control and recovery, making systems fault-tolerant, and making systems scalable. This has certainly been in the context of money-making companies like IBM, Tandem, DEC, and now Microsoft. So yes, it has had significant commercial impact. Q: Now you are an Microsoft employee and you are involved with scaleable computing, building super-servers and work-group systems from commodity software and hardware. Microsoft is not well known in the high-end market and starts to supports servers with its 64 bit Operating system. Can you please describe in detail, in which direction and are you doing your research at MS? Are you checking systems which later on will be available in the Microsoft portfolio ? A: Few people seem to have notice that Microsoft SQL Server leads Oracle and DB2 in the performance rankings now. It has the best performance and best price performance of any mainstream database system (see http://www.tpc.org/tpcc/results/tpcc_perf_results.asp and http://www.tpc.org/tpcc/results/tpcc_perf_results.asp?resulttype=noncluster&version=5 .) I have been building multi-terabyte databases (Terraserver.net, SkyServer.SDSS.org) as ways to explore the scaleability issues of large and high-traffic web servers and database servers. There are lots of papers about this on my website. Yes, many of the ideas we come up with find their way into Microsoft products, and I spend considerable time working with the Microsoft developers on new products and product ideas. Your keynote talk at ISC 2003 focuses on computer challenges in analyzing and mining of PetaByte scientific data. Do you use similar methods that are known in the commercial world, e.g. customer behavior, or do you use other algorithms? What are the differences and the similarities in analyzing and mining in the scientific and commercial world? A: Many people use a multi-petabyte Google database every day. Many use the multi-hundred terabyte HotMail and Yahoo! Databases every day. Scientific computing is a small niche of the Information Technology industry – and most of the innovation is outside the scientific community. Indeed, I am trying to apply the techniques developed in the commercial world for large databases, parallelism, web services, XML, data mining, and visualization to science problems. Scientists are largely operating with files and some form of super-grep rather than with data structured in a database with non-procedural query languages. The typical Wal-Mart purchasing agent works with a 300 terabyte database and has great data mining and visualization tools to explore that data. It is paradoxical that these folks have better tools than the scientists. Q: Can you summarize the most important topics in the structure of the data base, in the queries and the visualization tools? A: Not really, it is a very broad topic. For the Physical sciences the issue is primarily scaleability. Dealing with large datasets and developing linear-time algorithms that are amenable to parallel execution. But in bio-informatics there is relatively less data (mere terabytes) and so the issues are much more algorithmic. In some sciences there is a real desire to share and exchange data, so common schemas and web services are the central issues. There is not a single issue that covers all the disciplines. In working with the astronomy community I am starting to focus on this latter issue of data interchange via XML schemas and web services. Q: Are there specific hardware and software requirements or can you use commodity software like Oracle? What are the requirements of the data management system? Do you use a specific data structure to improve the performance of the analysis? A: The TPC benchmarks show Oracle prices are the highest in the industry (and Microsoft's are the lowest). But, yes, we are all using high-volume low-priced hardware and software so that we can afford more of it. As you might guess, I am working primarily on Windows/Intel equipment but everything I do must interoperate with Macintosh and GNU/Linux. Web services make that a LOT easier. The main requirement for the data management system is that it be easy to install, operate, and use; and it must be rock solid. If it cannot pass those tests, then nothing else matters. So, once you get past that, then it must be able to scan through large volumes of data quickly by using parallelism and by using sophisticated algorithms. The database folks have made huge progress in the last decade. There is still a long way to go, but the current products are all quite functional and useful. Q: Which type of machines are you using in your analysis of astronomic data? A: We have a few Intel Itanium machines but mostly we work with IA32 machines with about 4 GB of memory and typically 10 disks per CPU. Most of the machines are dual processors. The astronomers are not rich, so they typically buy machines with the best price-performance. The fact that Microsoft makes server and tool software freely available to the academic community via MSDN Academic allows the scientists to be able to use Windows. Q: Which data types do you use and what are the advantages of XML? A: In both the TerraServer and in the astronomy data, more than 90% the bytes are pixel-arrays. These are represented as binary-large objects in the database – either raw or JPEG. After that there is a mix of numeric (floating point and integer) and text data (unicode), and temporal data (UTC timestamps). These primitive types are exported as either comma-separated-lists for traditional tools, or as XML Schematized documents using the XSL base types and the XSL constructors for lists, and structures. There is an evolving stack of higher level standards for MIME documents (DIME) and for data sets (collections of collections of records). The advantage of XML is that it is a common language that allows us to exchange data between heterogeneous systems. Q: Can you please give a summary on the computational science, which is my personal background ? A: There are now two branches of computational science: comp-x and x-info (for x in physics, chem, bio, eco, astro, art, music,...). Computers have been traditionally used to model dynamic systems that are so complex that they have no closed-form solution. But, science is now getting instruments (including simulators) that generate vast quantities of data. And the data is being integrated by the internet. So a new branch is emerging that works on extracting information from this data. That is the database/data mining/statistics/visualization branch (the x-info branch). Thank you very much Dr. Gray for this interview. Now I am eager to hear your Keynote Presentation on June 25, 2003 at the International Supercomputer Conference in Heidelberg. http://www.isc2003.org/ Uwe Harms Harms-Supercomputing-Consulting