IDC TECHNOLOGY ASSESSMENT: A New High Bandwidth Supercomputer: The Cray X1

In this article, IDC delivers dependable insight and research on Cray’s challenges and opportunities. By Earl Joseph II, Christopher G. Willard, Ph.D., Nicholas J. Kaufmann, and Debra S. Goldfarb IDC OPINION WHAT IS THE FIT FOR THE CRAY X1 SYSTEM IN THE HPC MARKET? The immediate future for the Cray X1 looks good with the announced advance orders and five early installations. A number of supercomputer buyers have expressed interest in the Cray X1 which means if Cray can execute, it can regain some of the high-end market that their predecessor firm Cray Research lost over the past decade. WHAT DESIGN TRADE -OFFS AND BUSINESS MODELS ARE REQUIRED TO DEVELOP MODERN HIGH BANDWIDTH SUPERCOMPUTERS? Mainstream HPC vendors have followed a leveraged product strategy to successfully address the broad HPC market leaving an unfulfilled market segment at the high end of the market. To address this strategic segment requires a business model that combines government R&D funding, close custom partnerships along with a targeted high performance system design philosophy. CRAY ANNOUNCES A NEW SUPERCOMPUTER SYSTEM On Nov. 14, 2002, Cray Inc. announced the Cray X1 supercomputer system. Cray previously announced that it had shipped five early-production Cray X1 systems. Early customers include the US Army, Spain’s National Institute of Meteorology, and additional undisclosed customers that we believe include DOD and NSA sites. The system was previously code named the Cray SV2, and represents a new vector-MPP architecture and overall design strategy for the company. Over the last decade Cray’s major product lines have been based on vector processors (the C90, T90 and SV1) and MPP architectures (the T3D and T3E). Cray elected to combine these two product lines into a single unified architecture for economic efficiency, and to differentiate itself from the mainstream market for lower priced (and lower efficiency) standard HPC computers. THE CRAY X1 DESIGN The overall design goal of the X1 is to provide both the historic high bandwidth of vector supercomputers along with the efficient scaling of MPPs. This translates into several specific design elements including: * New custom processor architecture – The system is designed around custom multi-piped vector processors, with 12.8 Gflops peak performance per processor (25.6 Gflops for 32-bit computations). ** Processors use a new Instruction Set Architecture (ISA) that is partially based on the MIPS ISA plus many additional instructions to support vector processing, special instructions and other enhancements (e.g. fixed instruction size, more registers, masked vector operations, large integer vector support, 32-bit data, and cache control). ** It carries over the multi-streaming concept introduced in the SV1. In addition, the processor design incorporates superscalar processing, and integrated vector caches, and a decoupled microarchitecture that allows the processors to better tolerate memory latencies. ** Processors are configured with 8 vector pipes. * Balanced high bandwidth memory systems – The system is organized into fourprocessor nodes, each of which contains 128 Rambus memory channels, for a local bandwidth of 200 GB/s. The nodes are connected by 16 parallel networks, providing 25 GB/s of point-to-point bandwidth. In maximum configurations, the network scales to over 4 TB/s of global bandwidth. * Scalability – The Cray X1 is designed to provide high performance at both small and larger processor counts. The system scales to 1,000’s of processors. The scalable address translation mechanisms and communication protocols were carried forward from the Cray T3E design. CRAY X1 HIGHLIGHTS The highlights of the new Cray X1 include: * Scaling from 4 to over 4,000 processors * Each processor is rated at 12.8 peak GFLOPS, the processors are constructed of four sets of scalar/vector units to create a MSP (Muliti-Streaming Processor). The processor chips run at both 800 MHz for the vectors units and 400 MHz for the scalar units. ** Providing 3.2 scalar GOPS and 12.8 vector GFLOPS per MSP processor (25.6 GFLOPS in 32-bit node) * High bandwidth, low latency memory system design: ** Processor bandwidth to cache is 76 GB/s (50 GB/s for loads and 26 GB/s for stores) ** Peak bandwidth to local main memory is 51 GB/s per processor (38 GB/s sustained). Global interconnect main memory bandwidth is 102 GB/s per four processor/memory node board. ** I/O bandwidth is 4.8 GB/s per 4-processor node board and up to 75 GB/s per cabinet. Up to one I/O channel per processor. Each I/O channel is 1.2 GB/s full duplex, and is globally accessible by all processors in the machine. ** The latency to global memory is in the microsecond range in the largest configurations. Typical latency across a 512 processor system (128 nodes) is around one microsecond * U.S. list pricing begins at $2.5 million SOFTWARE HIGHLIGHTS The operating system is called UNICOS/mp. It is based on System V Unix with all of the POSIX commands, utilities, and interfaces (IEEE Std 1003.1 and IEEE Std 1003.2). It is an updated version of the Cray T3E UNICOS/mk micro kernel operating system concept. Software features include support for both shared memory and distributed memory programming models and environments, i.e. C, C++, Fortran, SHMEM, Co-array Fortran, UPC, Parallel C, and then Open MP in 2003. Additional software features: * Check point restart * Political scheduling mechanism * Global resource manager * Accounting and resource limits * OpenSSH and OpenSSL * PBS Pro batch processing * Partitioning * Total View debugger CRAY X1 PRIMARY TARGET MARKETS * Military and defense customers, including national laboratories and ASCI * Automotive and aerospace customers * Academic researchers competing with other institutions * Researchers in new areas of science, e.g. bio/life sciences * Researchers in traditional scientific areas that are very computationally demanding, e.g. weather and climate modeling IDC ANALYSIS SUPERCOMPUTERS VS. SUPERCOMPUTING Technical computing is often described with two similar but somewhat different terms: supercomputing and supercomputer. Supercomputing is the activity of running computer simulation and analysis in support of scientific research and engineering design activities. Supercomputers (in the purest sense of the term) are systems specifically designed to support supercomputing activities. The difference between these terms is important because over the last decade improvements in absolute performance in merchant or commodity processors have allowed users to move a growing number of applications from supercomputers to systems designed to address the needs of broad commercial and technical markets. Put another way it is not always necessary to buy a supercomputer in order to do supercomputing. The current generation of broad-market technical servers are built on a strategy that allows vendors to leverage advances in high volume component technology, and to leverage their R&D investments across both commercial and technical markets. These leveraged products generally provide a low purchase price and are easy of use. However, they are not designed to provide the highest levels of performance for next generation technical problems. The effect of the movement of many user applications to leveraged products is that all large HPC vendors have decided to not pursue development of lower volume specialty supercomputers. DOES ONE SIZE FIT ALL? Leveraged product strategies have proved effective for the broader technical server market based on a combination of lower cost, strong price/performance, and high scalability computer designs. That said, there is a set of customers that require computers designed to address very demanding next generation problems and are not willing or able to wait for Moore’s Law to provide solutions based on leveraged products. IDC surveys of users over the last two years have shown a growing dissatisfaction with the performance capabilities of the currently available capability systems. However users have also expressed strong economic pressures to purchase lower cost computers. In Japan the Earth Simulator has brought attention to this unfulfilled need. In March 2002, the Japanese Earth Simulator supercomputer system was accepted and began full operation. It has sustained performance between 5 to 10 times that of the largest installed supercomputers in the world. It has kindled interest within the U.S. to field systems in the same class as this impressive Japanese achievement. HIGH-END APPLICATIONS Many large technical computing problems provide a great challenge for leveraged systems because they require both fast processors, and the ability to manage large data sets with frequent updating. Systems that can effectively address this class of problems must provide high memory and communications bandwidths (tens of GB/s per processors) with low latency (e.g. only a few micro-seconds across larger configurations) across a sizable number of processors. Total system bandwidths are important to such applications that require strong memory performance across all processors to main memory. Thus high single processor bandwidth and large processor caches can help but are not be adequate to address many large problems. Table 1 presents applications in different industry sectors that can make use of next generation (or of the next several generations) computing power. HPC APPLICATION AREAS WITH NEXT GENERATION REQUIREMENTS Sector Applications Sector: Automotive Applications: Full fidelity crash, Integration of design, engineering, manufacturing and test, Full durability -- 150,000 miles, no failures Sector: Health Sciences Applications: Genome and physiological modeling, Vascular simulation, Virtual surgical planning, Cell membrane modeling Sector: Chemicals Applications: Developing new biological compounds, Process modeling Sector: Petroleum Applications: Seismic and reservoir visualization, Improved substrate modeling, Forward analysis of oil-bearing structures Sector: Defense and national security applications Applications: National laboratories, ASCI Sector: Weather forecasting and climate modeling Applications: Simulation of larger areas with integration of earth, air and water models, Climate modeling Source: IDC, 2002 BUSINESS MODELS FOR SUPERCOMPUTERS : DESIGN TRADE-OFFS AND APPROACHES TO DELIVER MULTIPLE TFLOPS The technology for technical servers has evolved dramatically over the last decade. There are now four types of capability computer architectures that are installed at customer sites that deliver more than a Tflops of performance: * Scalable SMPs from HP, IBM and SGI * MPPs from HP, IBM, Intel and Cray Inc. (formally Cray Research) * Vector supercomputers from NEC * PC clusters from a number of manufactures (HP, IBM, Linux NetworkX, etc.) CRAY PRODUCT DESIGN PHILOSOPHY In designing supercomputers one has to make a number of trade-offs. All of the larger HPC vendors have settled on a common design approach using high volume microprocessors and leveraging commercial computer design investments. In the past, designs often flowed from the leading edge scientific computers to the high volume commercial computers. Today the larger vendors are taking the opposite approach of applying commercial computer technologies into the scientific market. This provides the benefit of leveraging development costs across a much larger volume of sales and produces computers that have a lower production cost. It is a leveraged business model that works. But there is a portion of the HPC market that desires computers designed to deliver the highest performance: performance that is significantly higher than available in higher volume computers. To achieve a higher performance level requires expensive components in the memory sub-system, in the I/O, in the software, as well as in the processors. It also requires the ability to scale to larger processor counts and still have high speed, high performance interconnections between all of the processors, all of the memory sections and all of the I/O devices. Computer designers have to make many trade-offs when designing a computer. An old saying is that one can select any two of following attributes, but never all three: 1. Higher performance 2. Lower cost 3. Ease-of-use Vector supercomputers were first designed with high performance as their primary design goal. As software improved and became more standard in the 1980’s, vector computers were designed to have both performance and ease-of-use, but they were never low cost. MPPs were designed around two attributes: a) higher performance; and b) lower cost (by using standard microprocessors), but were never easy-to-use. The current generation of broad market technical servers was designed to be low cost and broadly applicable, but can be difficult to use in larger configurations and does not provide the highest levels of performance on certain larger applications. The Cray X1 is designed to provide higher performance at both small and larger processor counts. This means that its price will be higher than standard computers and its ease-of-use will be lower, e.g., confined to a smaller set of third party applications. Its price/delivered performance on larger problems should be strong. In order for the Cray-1 to provide both the historic high bandwidth of vector supercomputers and the efficient scaling of MPPs, Cray had to focus the design on a single primary attribute: higher performance. BUSINESS MODEL FOR CUSTOM SUPERCOMPUTERS The custom supercomputer business model requires the support from a number of close partnerships with customers: customers that require more than what is available in the COTS world and who are willing to pay for the additional capabilities to gain a competitive advantage. The Japanese government has followed a similar business model with NEC and the Earth Simulator. Given the relatively small size of this market segment (compared to the entire market for technical and commercial systems), the business model no longer makes sense for the larger HPC manufacturers. Whereas Cray, with partial government R&D funding support, may be able to make the business model work in this segment. The Cray X1 is targeted at larger problems that need to scale beyond the size of a typical single SMP node. Cray needs to convince both its traditional customers and new prospects that the cost of converting their software programs to the X1 is worth the advantage in higher performance. Early indications are that Cray is achieving some success, spurred in part by the desire for a high-bandwidth U.S. built system with the potential to match or exceed the performance of Japan’s Earth Simulator. CAN SUPERCOMPUTERS CREATE A RENAISSANCE? At a fundamental level a number of questions arise about the ability of supercomputers to compete with Moore’s Law in the broader market. The Moore’s Law side of the argument is: 1) the historic ability to double peak processor performance every 18 months combined with; 2) the historic trend for technical applications to move “down market” as the absolute power of RISC processors has increased. That said, there are several factors that would support a re-emergence of specialized supercomputers into the market: * Increasing dependence on parallelism – Despite the growth in single processor performance provided by Moore’s Law, technical users have continued to work to parallelize their codes and implement ever larger system configurations. This support of parallelism implies that many technical applications demand additional power more quickly than what can be provided by the expected advances in single processor technology. Parallel computing strategies have been limited by applications fit, programming costs, and system interconnect performance. For applications that can only use low levels of parallelism (i.e. 4 to 16 way) the technology can be viewed as a performance kicker rather than an ultimate performance solution. * Requirements for memory interconnect speed – Although the performance of both industry standard processors and interconnects increase non-linearly, processor performance is on a faster growth path with interconnect speed falling behind. Over time this will force either greater levels of parallelization or greater emphasis on specialized interconnect technology. * Market forces – The investment in processor technology that has driven Moore’s Law has in turn been driven by opportunities to sell each new generation of products into seemly inexhaustible volume server and desktop markets. However, with each turn of the Moore’s Law crank a larger segment of end-user applications cannot keep up – many applications simply cannot make full use of a 1 GHz processors, let alone the current 2 GHz, or future 4 GHz and 8 GHz products. Thus Moore’s Law may be slowed not so much by technological limits as by market forces. Currently Moore’s-Law-based product strategies have met the requirements of the majority of technical users, with the movement to clusters indicating that a significant part of the market is placing even greater reliance on commodity processor road maps. However, the factors mentioned above could, over time, lead a portion of technical computing applications requirements to outpace Moore’s Law and thus lead to requirements for systems specifically designed to address scientific and engineering environments. CRAY CHALLENGES AND OPPORTUNITIES The key issues for Cray to address include: * Positioining the X1 as providing special capabilities beyond what is available in the COTS world -- The Cray X1 is designed to provide higher performance at both small and larger processor counts. This means that the system price will be higher than standard computers and its ease-of-use will be lower, e.g., confined to a smaller set of third party applications. Thus the company must position the system as providing capabilities not generally available in the rest of the market. * Meeting competitive market timing requirements -- Time-to-market is everything in defining a competitive technical computer. Cray needs to bring the X1 to market as quickly as possible and provide major enhancements and speed-ups within 1 to 2 years. * IDC’s experience shows that a competitive HPC product design that hits the market 6 months ahead of schedule will be a major financial success; if it is 1 year later than planned it will have limited financial viability; if it is 18 months late to market it will be irrelevant. * Moore’s law implies that if a design is 18 months late to market it needs to be twice as fast as the original plan to be competitive. * Capturing the traditional capability market – To make the business model work requires both healthy sales and continued government R&D funding support. Acceptance by customers in aerospace, automotive, weather, academia, and national research labs around the world will be key to generating the required sales volumes. A handful of very large wins (ASCI class sales of $50 to $100 million each), however, could also be enough to make their business model work. * Expanding the X1 applications library – Ease-of-use and third party software will determine how many additional customers will decide to acquire custom Cray computer designs. With limited ISV applications, the X1 will often be acquired as a specialty computer for one or a few mission critical jobs. Additional software will be required to make the X1 a general-purpose supercomputer. Cray says it plans to do this over time. * Delivering a completely new OS is a major undertaking – the new ISA, new OS, new environments, etc. will take some time to work out the bugs and build reliability. Applications will require time and assistance from Cray for retuning and recompiling. * Generating growth in emerging markets – Cray must demonstrate the value of high bandwidth supercomputer solutions to new prospects in higher growth market segments such as the bio/life sciences. CONCLUSIONS The immediate future for the Cray X1 looks good with the announced advance orders and the five early installations. A number of supercomputer buyers have expressed interest in the Cray X1 which means if Cray can execute, it can regain some of the high-end market that their predecessor firm Cray Research lost over the past decade. MARKET OPPORTUNITY IDC estimates the 2001 market for both Capability and Enterprise systems at $1,865 million. We expect this market to grow at a 9.9% CAGR from 2001 to 2006 to reach $2,991 million by the end of the forecast period. The Cray X1 product competes in both of these market segments. Cray can grow 10% a year by simply riding this market growth curve. However, we believe that much of the growth in this market will be driven the U.S. national security sector, which has traditionally required U.S. based products, and supported innovative supercomputer designs. Thus to the extent that Cray can capture U.S. government business it may be able to outperform worldwide market growth rates. In addition, Cray Inc. held a 2.7% share of the combined market in 2001. If the Cray X1 series is successful, the company can see further growth by capturing additional share in the market. RELATED RESEARCH IDC technical server research is available at: www.idc.com and at: www.idc.com/hpc SOURCE: IDC