Sun's Steve Campbell Speaks On The Sun Fire Link

By Chris O'Neal – At SC2002, Sun Microsystems launched a new high-performance cluster interconnect for the Sun Fire 6800, Sun Fire 12K and Sun Fire 15K. By connecting up to eight servers with its advanced optical communications technology, Sun Fire Link delivers scalable cluster performance with incredibly low latency. To learn more, Supercomputing Online interviewed Steve Campbell, Vice President Marketing, Enterprise System Products. SCO: At SC2002, Sun is introducing a new high-performance cluster interconnect, called Sun Fire Link. Please share the highlights with us. Campbell: There are a couple of highlights. One is the actual technology itself and the other part is how do you apply that technology. Sun Fire Link is an extension of the system’s backplane which results in high bandwidth with extremely low latency. Our bandwidth is 4.8 gigabytes. Probably aggregates and delivers about 2.8 gigabytes in a sustained mode. Obviously at times you will get bursts. But sustaining at 2.8 is what we have seen in many applications. Coupling that high speed capability enables us to link up to 8 Sun Fire servers. These can be either: 6800s, 12Ks, 15Ks or a combination of these. So when you take 8 - 15Ks, you have a total of 800 processors. This technology enables that kind of scalability on up to eight nodes and a peak performance approaching 2 TFLOPS. So you’re delivering a substantial amount of performance with that kind of configuration. It is breakthrough technology and optical interconnects. A key is to recognize the kind of applications that can utilize this. You have the physical hardware layers of this super cluster. And you bring in tools like HPC ClusterTools, which are the MPI capabilities, and now I can start to apply all of these to solving a single problem with the substantial computer power that can be applied to that. Couple that with storage and SANS and you have very large processing capabilities and large data management applications. It’s been a technology which we've installed at a number of sites. So it’s tried and proven. People are using it today. That’s typical of what we do at Sun. Before we do a general launch like this we make sure our products have Beta tests and passes all the various quality aspects of our products. So we have customers who are using this in their environments. And they are very happy with it. It’s not just an interconnect and it's not just high-performance, it's also has levels of redundancy. So if there are failures in the link, then the system can continue to operate. For example at the show, we have the ability to demonstrate unplugging physical links. If you unplug a physical link that would imply that the system is down, the connectivity is down, but you still have parallel links and you can still operate. So you don’t bring the complex down and stop. You still have application availability. That application availability we provide through triple redundancy levels at the hardware side. That is the key to this technology. It’s not only fast, not only an extension of the backplane, it is not only got software that works within it, but it also has redundancy that was built into it as well. That matches the whole philosophy of Sun and particularly the high-end, mission critical systems. It’s all about application availability. And when you have a large computing complex solving problems, you have applications that can run for many days. And we can actually provide the availability for that. There’s no single point of failure on the server, no single point of failure in the interconnects. So it’s high availability, high level of redundancy, high level of performance and breakthrough technology. It’s a ‘wow’ product. SCO: What is Galaxy? Campbell: Galaxy is a customer program beyond that. This kind of interconnect qualifies customers to be in the TOP500. Galaxy class basically is the class of customers that uses interconnects to solve large problems like Cambridge High Performance Computing Facility. They are able to move into the Galaxy class and will become a Sun Center of Excellence. Customers will have the combination of basic hardware, software and interconnects. What can we do with it? So we make it a lot easier for people to actually do business and move into the top echelon of supercomputing and super clusters. It’s easy to embrace, easy to understand and it gives recognition for the users having one of the largest computing complexes available. It’s all commercial off the shelf technology. It’s not proprietary architectures. It’s all open source focus. A lot of people have built proprietary supercomputers and it’s very expensive. Whereas the industry wants to capitalize on commercial off the shelf technology and this is all commercial off the shelf technologies. SCO: In today's challenging economic environment, companies are taking a hard look at their bottom line and growth strategies. StarKitty and StarCat systems offer an increased ROI. Please describe the main factors that contribute to this. Campbell: People want to get more for less. They want to get better utilization of the systems they have. And there are fundamental technologies in StarCat and StarKitty that allow them to get better utilization. So let’s take a StarCat 72 CPU for example. How can I use those 72 CPUs? Well, I have one choice to use them as a single domain, single complex. And for certain applications that scale well, that’s a very good solution. However, most people actually want to have those systems split into domains or partitions. Probably the average number of partitions on a system are about four. So let’s say I take those 72 CPUs and what I’m able to do is split those into unequal chunks. And in each of those I can run Solaris and I can run applications. So I’ll eventually get more utilization of those systems. Then, I add other resources like Solaris Resource Manager (SRM) that will manage the stack applications within that domain. So a domain is not running a single application. A domain has multiple applications. SRM manages various resources within that partition for those applications. Solaris Containers is Sun's next advance step in server virtualization. Sun plans to roll out components of Solaris Containers in phases. The release of Solaris 9 Resource Manager marks the first step. It provides the framework for Solaris Containers. These provide another level of utilization in the domain. In addition to that, my domains are dynamic, in the sense that I can resize and reconfigure the resources within my domain based upon the application needs and requirements. I can add additional CPUs by doing a dynamic reconfiguration or I can add additional IO resources. Or I can shuffle around and do what I want to do. It’s extreme flexibility within those domains. I also have high availability and security. And my domains are 100% fault isolated. So I can have development in one domain and production in three domains for example. As I develop code the chances are you are going to have bugs in the codes. Who knows what kind of behavior the bugs will have on the systems. Because they are fault isolated, it has no impact on my production work at all. So I can run production and development on the same system. I can resize my partitions and I can do a lot of things with those to get that computing complex up. So those are all methods and there are many other methods, but those are the main one by which I provide better utilization of the system. I provide virtualizations of it through Solaris Containers tool. I also provide availability because if I want to do maintenance or upgrades, I don’t have to break the whole compute complex down. Basically, I do a dynamic reconfiguration. Pull out the board or pull out what needs to be repaired or changed, or add additional resources to another operation that brings it online. At the same time, my other domains are all running applications. I can do online maintenance, online upgrades all geared around availability. So I don’t have to bring it down. I’ve got multiple redundancy everywhere through the system. In the case of hardware failures, I have no single point of failure. SCO: How can Sun drive down the total cost of ownership? Campbell: What people are looking for right now is as you drive your requirements up you want to do more with less. That means I need to provide a better quality of service to my users for a lower cost. And that’s what we do with our Sun Fire family of servers with all the things like the maintenance and availability features we talked about. The other way of doing that is basically providing the customers with a flexible system acquisition model. In other words, I’m not sure what my performance requirements are today. It could be 20 CPUs or it might be 40 CPUs or something in between. So what do you do there? Well, we just brought out Capacity On Demand. That’s all geared around lowering the cost of acquisition. For example, if I’m not sure how many CPUs I need to solve my workload. Or I’m in a marketplace that’s growing. Customers needs grow and change. So people can acquire a 24 CPU system and license only 12 of those CPUs, which mean 12 are unavailable. And as my demands grow and my throughput grows, I can turn those on by a simple Web transaction and get a license code that enables those CPUs and I can turn them on. So that’s lowering the acquisition costs. Built in expansion is right there in the box. And we did this with a customer with the Enterprise 10,000 some three years ago. And about 20% to 25% of our systems were being shipped with Capacity On Demand. So we know it’s successful and we know how to make it work and we do it in a very non-invasive fashion. Things like utility computing are ways of trying to do this. But utility computing requires a high degree of monitoring. It also has issues of who owns the asset. So it’s very complex. Talk about software licensing problems, utility computing represents a huge licensing issue for the software vendors. They’re just not geared for that. Capacity On Demand for us is simple, easy to use, fast and responsive. When the demand is there, you crank it up, get a license and turn on the CPU. There’s no cost-disadvantage what so ever. The cost of a license is the same as if you bought it. When you go to the utility computing model there is a cost disadvantage for the customer. He pays more for the compute power. The reason for that is he doesn’t actually own the compute power. When you do Capacity On Demand with Sun, you are actually buying that asset and so you own the asset. SCO: Partitioning is the key to the current consolidation trend among corporate customers. Please describe the benefits to Sun’s technology. Campbell: Partitioning is when you take a CPU complex and split it into partitions and you can do consolidation. Consolidation means different things to different people, so let me set the framework. To me when you are doing things like server consolidation it means you are taking a lot of servers and consolidating them to a single server. In the case of the Sun Fire servers, I’m able to consolidate a lot of smaller servers into partitions on a single server. So for example, I acquired a 12K and got 4 domains. And I’m taking 100 Wintel class servers running some application and I can do consolidation on those into a single domain and I can take another 100 doing some other application also. So I can start to consolidate many different servers with many different workloads and many different applications onto my domains with a single footprint. So my footprint is smaller, and my power consumption is going to be less, and my software license fee is going to be less. And I have huge availability all of a sudden and huge performance throughput. So I benefit in every dimension. And the dynamic system domains of partitioning on our products make that all very possible. It’s very successful. Now we also actually have mainframe consolidation. We are taking mainframes and migrating away from legacy mainframe systems and consolidating onto a Sun Fire. So we go both high - down and small - up. And provide very good TCO and price performance for all of those configurations whether you’re coming down with legacy code or moving across from a Wintel based or some other UNIX based platform. It could even be Sun Solaris systems. We’ve done a number of consolidations taking many Sun Solaris based systems and consolidated them onto a single 12K. So were able to go into the marketplace and consolidate Wintel systems, HP or IBM systems or IBM mainframes. Partitioning makes that whole consolidation very easy to understand. So it all works very, very well. SCO: A key strength Sun has is its strong software base. It’s the server of choice for software companies. Please describe Sun’s services and software plans. Campbell: That is very true. We’ve got 12,000 to 13,000 ISV software codes running on the Solaris platform. That has come about over a number of years of working with the development community. What the development community likes about Sun is the fact that Sun’s products go from about $1000 to multi-million dollars. It all runs SPARC instruction set and all runs Solaris. It’s all basically the same kind of architecture, SMP class architecture. Let’s say I’m a software company and I want to be with IBM or HP. In the case of IBM, for the same price band, I have Intel platforms running various forms of Windows, Power PC platforms at different price points running AIX, AS400 platform, and then at the higher-end I have Z900. So I have 4 instruction sets at a minimum. Then I also have 4 operating systems and actually I've got Linux as well. Then I realize the Linux I run on my i series is not the same Linux I run on my p series. Now I have 6 operating systems. Before you know it, in the same price point, I have at least 8 to 9 operating systems. You've got at least 4, possibly 5 instruction sets. When you look at HP’s product line, it’s every bit as confusing. I’ve got Intel down in the low space; I’ve got IA64 somewhere else. I've got ALPHA somewhere else. I've got PA-RISC somewhere else. I’ve got TRU64. I’ve got HP-UX, Linux, and Windows. So again, you've got 4 or 5 instruction sets. There’s no uniformity in that architecture. Sun’s success with these companies is because we have the uniformity. They can develop the software once, develop the Solaris, develop the SPARC and it’s going to run well on all the platforms. It’s a huge investment protection. It’s a huge advantage to software companies. It’s a single level of effort. Support becomes a piece of cake because you’re just supporting the Sun/SPARC software. That’s where you spend most of your resources as a vendor ISV is in support. Writing code is one thing and you have to test the code. If you have 4 or 5 different ports, then you have 4 or 5 different versions to actually support as well. That’s why I think Sun/SPARC is a very attractive proposition to ISV vendors and to large corporation that have deployment of multiple price points. If you look at tier architectures of tier 1, tier 2, tier 3. Tier 1 you’re on the edge of the network, it’s SPARC. Tier 2 you’re in the middle of the network and its SPARC. And tier 3 is SPARC. Again you can have tremendous flexibility in those mission critical centers. So that’s a fundamentally strong program to have. It really is. The competition will tell you that Linux will provide the same kind of thing. But Linux is not in the 24/7, mission critical environments. It has a long way to go yet. And guess what; even then I’ve got different Linux sitting on my different boxes. I’ve got different Linux on my IBM i series than I have on my p series. And I have different acquisition model because I don’t actually buy sometimes from IBM. I have to buy it from somewhere else. Who am I buying from? And it destroys the aspect of trying to lower the cost of acquisition. So we’ve been very successful with our strategy. Fundamentally, our strategy is sound. Our ISV solution is very, very strong. And we have the i-Force community which is 40 or so centers around the world where we can bring in partners to the i-Force community. And they can get their applications solution working and it’s demonstrative. They can look at it. So it’s a proven concept. Our whole approach to ISVs is very successful and a very strong program. SCO: So, these are exciting times for HPTC at Sun. Sun has dramatically increased its presence on the Top 500 list at the expense of IBM and HP. Sun had 37 systems in the June 2002 list, but wider adoption of the StarCat has given Sun 88 places on the November 2002 list. Please describe the main factors that contribute to this. Campbell: Yes, we’ve gone from 37 to 88 places in the Top 500 as an indication of the product acceptance in the marketplace. These are big systems. So we’ve been successful and it’s an indication of our continue investment in this marketplace. That’s very important that we are investing in high performance technical computing applications. We are investing from an engineering point of view (POV), from a marketing POV, from a distribution channel POV and investing from a partnering POV. Now proof points of those; from a product POV, the Sun Fire Link is a very good example. When you go to the booth and see the graphics and visualization that’s another good example. We are demonstrating our next-generation MAJC-based, high-end graphics clustering technology. Our demo will be running throughout the show on our big screen. This will be a full fledged product early next year. HPC ClusterTools is another good example from an investment POV. So we’re investing in our R&D. We’re investing in distribution through the I-Force centers and through our sales force. We’ve got the HPC ACES program for training and there’s a room here that’s full of people from all over the world. They specialize in the high performance computing market for us. And I’m giving a brief talk to them together with partners later today. So we are sharing training, product training and demonstrations. When you go to our booth, you will see grid software partners, demonstrations on crash analysis. So our partnerships are very strong. We are investing in distribution channels and partnerships. And we’re investing in marketing programs like this show. We are investing our marketing dollars. We have moved the budget around. We are fiscally responsible. And we’re dedicated to this marketplace. This supports our partners and customers to make decisions to buy Sun. So it’s a big investment for us. It’s a big market for us. We have grown from a 5% market share in 1996 to a 21% market share last year. We’ve grown. Of course, HP and Compaq are now one company so they aggregate their market share. But when you see the lack of continued investment in the Alpha chip, you have to question HP’s commitment to this market. So I would question their investment in this market. Their focus right now is outside of this market area. If they are going to be successful, it’s not going to be in this market. SCO: Let’s spend a moment discussing chip and server designs. Please describe the advantages to your company’s own UltraSPARC processor. Let’s conclude with your vision of the future of supercomputing. Campbell: There are several things there. People ask this kind of question many times in different ways. How do we continue to support the SPARC architecture? Everybody else seems to be moving to some industry standard type architecture. We license our architecture, so our architecture is not proprietary in that sense. We will license SPARC technology. Fujitsu is an example of doing it. And they are also running Solaris. And there’s is a desktop company that has a SPARC based laptop. The fundamental difference between us and Intel is they have chip design and a chip fabrication and foundry. Sun is the world’s 2nd largest chip design company. Of course, Intel is first. However, the difference is that we have not made the investment in a foundry. A chip foundry is a multi-billion dollar investment. So we opted to work with companies like Texas Instruments. And TI is the foundry that actually builds our SPARC chip. TI is in the chip business. They are in it for cell phones and that’s a huge market. And you have to stay on the bleeding edge of foundries. So they make the investment in that kind of technology and we are able to utilize those TI efforts. We can still make the chip design, but we don’t have to make the huge cash investment. Intel has to build a factory and keep it modernized and make a huge investment to be able to keep pace with chip foundry requirements. We have a big development team. We have a big investment in Sun systems that do all the process simulations. And we work with TI who builds the chip for us. So actually we get the best of both worlds. We ride the technology foundry curve based upon TI’s investment, which keeps them competitive in their business. And that’s a really good choice to make. Now SPARC, as a chip, is a very, very good chip. There are some innovative things we’ve done in the past, and there will be innovative things we will bring to market going forward. So we are every bit as good if not better than many chips in the marketplace. When I look at someone like HP and Compaq for example. Compaq decided to discontinue the highly regarded Alpha chip and go with Intel instead. And HP announced they were going to tank the PA architecture and go with Intel’s IA64. So in those particular days, you had two companies trying to convince Intel what to build. Now you have one company that is totally dependent upon Intel’s chip designers and Intel’s foundry to produce the base commodity for their product. So if Intel makes another mistake, then the whole thing is going to be weak from a server point of view. If Intel decides it wants to have a little widget put in there they do it because they are a volume based business. If they want some changes to the volume marketplace, then HP is out in the cold. So all of a sudden, HP has no chip design and has no foundry, which means they have zero control over their next generation chips going to build their server line. All of a sudden, what they thought was open industry standard is not anymore because the only people using this are HP. So this so called open chip now has become more proprietary than what they thought SPARC was. SPARC is not proprietary, it’s open. You and I, if we had enough money, could go license SPARC and Solaris. And that’s available and there are examples of that in the marketplace. So all of a sudden, the world has changed dramatically with this acquisition. Our methodology in the long term has proven to be extremely successful and very competitive. We are on the bleeding edge of semiconductor design. At the same time, we haven’t had to make a big investment in a foundry. So I truly believe we have the best of both worlds. We have absolute control over the SPARC architecture and our design to maintain binary compatibility going forward is absolutely critical. That’s the whole investment in software our companies has made as we go to future generations. So we have the number two world’s team designing chips. And we are able to use the number one world’s foundry. When you look back 10 plus years, supercomputing to many people was Cray’s big vector machines. Then, there was 'that market is going out of business because of the MPPs coming along.’ But you know what, they didn’t. The reason they didn’t is because of software. The software to make them work couldn’t be done. Now however there’s only one company with a big vector machine, NEC. Because general purpose chips have gotten so fast. So when you look at UltraSPARC III today and why were successful in the marketplace is because our off the shelf technology is as good as or better than what a vector could have done a long time ago. And there are only a few applications that can really use a vector machine at the end of the day as well. You can’t just pick your code up and stick it in a vector machine and watch it run. It doesn’t do that. There’s a lot of effort. So the market has changed. Software and technology has gotten better. Chip technology has gotten better. Reliability has gotten better. Reliability in today’s 15K vs. yesterday's supercomputer is just volumes of magnitude better. So the machines of yesteryear, though they were very fast, they would crash a lot. Going forward, I think we will continue to see the use of superclustering techniques like the Sun Fire Link. That concept is sound. I think you’re going to see continued investment in innovation at the software level. I think more importantly, we are going to see more commercial companies become a lot more public about it. A lot of people think HPC is just the academic community. And we all know that’s not the case. Typically commercial companies use this HPC technology to design their next generation of products. So they are driven by time to market and they are driven by the application software that’s available to help them; whether it’s a pharmaceutical company or an automotive company or an aircraft design company using this technology to gain a competitive advantage. If they can get to market with the next generation of widget 6 months sooner than their competition, then there are huge profits to be made. So we are starting to see those kinds of things happen now and we delivering on that vision. Right now, with companies in general, doing more with less is equally applicable in the HPC market. Lower cost of acquisition, reliability, system performance and throughput, not just focus on the chip, these are all more important things going forward. Supercomputing Online wishes to thank Steve Campbell for his time and insights.