Advancements in clusters improve engineering, electronic cooling simulations

By Chris O'Neal -- In June, Appro International held a seminar on its high performance computing (HPC) cluster solutions for electronic design automation (EDA) applications. This article reports the key areas of discussion from the event. HPC is experiencing fast growth although it still presents many challenges, such as data center requirements. First, this article covers the details about recent HPC market growth and trends, as well as EDA growth and trends. Secondly, this article discusses HPC with regard to simulation-based design for the electronics industry. Key sectors and business drivers are described, along with solutions for mechanical, electromagnetic, thermal and computational fluid dynamics (CFD) challenges. Thirdly, this article presents scalable and reliable HPC solutions currently available including Appro’s supercomputing and blade clusters using multi-core technologies. Lastly, this article discusses the new quad-core AMD Opteron processors and how HPC customers can increase productivity and performance in the same thermal envelope as previous AMD Opteron processors. Trends, Customer Directions and Growth Areas in High Performance Computing The high-performance computing industry has grown faster than any other IT sector in the last four years, and the cluster portion of the market is growing even faster. “High-performance computing” refers to all computers that are used for highly computational or data-intensive tasks. IDC now uses this term to include all technical servers used by scientists, engineers and financial analysts. HPC excludes commercial servers used for business and transaction processing, which is about 80% of all servers. HPC Market Growth and Trends The server market has finally rebounded from the 2001-2002 dot.com crash and has grown to $52 billion a year at a rate of about 4% per year. The technical computing portion of that market has grown to $10 billion with a 21% normal growth rate over the last four years. The primary driving force of that growth rate has been the demand for clusters and the price performance of clusters. Organizations have been buying more processors and filling up their data facilities. Budgets have grown at a very dramatic rate because of their demand for scientific computing. This has created new challenges for the data center. Whether they have a fixed budget or a growing budget, organizations now can buy a dramatically larger number of processors, as well as more equipment and servers. The new challenges are cooling the servers, providing power to them and floor space. Software will be the main roadblock and therefore will provide major opportunities at all levels: compilers, applications, middleware, system management and so on. According to IDC, the HPC market now accounts for nearly 20% of all servers, and there is speculation that this trend will continue to grow. Blades Compared to Racks In 2006 sales of blades grew to $2.6 billion, and this growth is expected to continue due to the efficient use of floor space. The main growth factors include the following:
  • Efficient use of space
  • Lower power consumption
  • Easier system management
  • Straightforward form factor for consolidation and virtualization

However, most users prefer 1U/2U servers when they require more memory or cache per CPU or more performance per CPU. Previously, blades lacked in technology, and they were too generic. Now blades are more targeted for HPC and commercial computing. Also, some of the early blade offerings from Appro’s competitors had cooling issues. Clustering and standardization have reset price and performance in the industry. A typical data center now has more nodes and CPUs. Challenges in space, power consumption and system management have led to system consolidation and virtualization. Blades optimize on environmental factors and 1U/2U servers offer richer configurations. Electronic Design Automation Industry Trends EDA has shown strong growth over the last five years and may double by 2010. It will be a segment worth over a billion dollars per year in server sales alone. Multi-core processors may cause the next big disruption in the marketplace, specifically in terms of how users can move their applications and take advantage of multi-core processors. Computing is entering a new phase in which CPU improvements are driven by the addition of multiple cores on a single chip, rather than by higher frequencies. For HPC applications, this introduces an additional layer of complexity, which will force users to go through a phase change in how they address parallelism—or they will be at risk of being left behind. Changes in the industry will include the following:

  • Commodity-based individual processor speed will be relatively flat.
  • Memory bandwidth to a single processor core will decline at a fast rate.
  • Individual processor costs will drop so quickly that in the near future the world can view processors as a nearly free commodity.

A new way of looking at and working with parallelism will be required to take advantage of the computers that will be mainstream products in the market by 2010 and beyond. Simulation-Based Design for the Electronics Industry ANSYS is a world leader in the analysis of engineering problems. This section describes the following issues for simulation-based design for the electronics industry:

  • Key sectors
  • Business drivers
  • ANSYS solutions

Key Sectors From ANSYS’ standpoint, the key sectors range from the silicone level to electronic products (such as computer servers and laptops) and the data center. Analysis products can address all the mechanical, fluid flow and thermo problems at these levels. Business Drivers The most important aspects a product can have are small size, fast performance and the ability to get to the market quickly, meaning, in the order of months instead of years. The consequences of these drivers include the following: rapid design cycles, complex 3D packaging (system to IC level), performance impacted by interference (shielding humans in terms of health regulations, as well as other devices in terms of performance), reliability impacted by thermal and material considerations (thermal problems limit designs), better upfront design and analysis as a part of design flow (saves overall cost, as well as MCAD/EDA layout tool connectivity in the flow). The consequences of business drivers create the need to establish design before the prototype is built, and this need is met by ANSYS software. ANSYS Solutions The solutions offered by ANSYS fall into the following categories: mechanical, electromagnetics, thermal, CFD and multiphysics. ANSYS products are applied across various engineering industries (aerospace, biomedical, electronics and so on). The application areas for the electronics industry include three major categories of physics: mechanical, electromagnetics or thermal, and fluid flow. The major types of ANSYS products are mechanical, which are used to solve stress and strain-related problems. For example, vibration could be introduced onto the electronics during its operation or when it is being shipped. Shock can occur during operation when someone accidentally kicks or drops a product. ANSYS mechanical products can be used to conduct drop tests, crash tests and explosion analysis. A major advantage is the ability to change the retention mechanism and then run another simulation. ANSYS products are useful for HPC for the following reasons:

  • CAE (computer-aided engineering) is among the most demanding applications for computing.
  • High memory requirements (demand for 64-bit processor technology)
  • Productivity and timely delivery are directly related to solve times (utilization of multiple processors and cluster computing).
  • Large data sizes (storage capacity can eliminate the need to rerun simulations)
  • Graphically intensive applications (3D geometry manipulation and animation needs)

AMD and Appro provide technology to increase performance, storage and memory. ANSYS typically works at the solver level, enmeshing in the fundamental physics, and improving the way solvers attack the problem. Scalable and Reliable Solutions for HPC When approaching a high performance computing challenge, there is no single solution that will solve every problem. Instead, a specific tool is used to resolve a specific problem. Therefore, Appro has developed different tools to handle different problems. One of its solutions is the Xtreme series 3U server, which is a 3U four-socket product. Appro has developed these products around the AMD Opteron processor, so Appro’s newest iteration of this product supports quad-core processors. Data Center Challenges Current data center challenges include the following:

  • Power/cooling
  • Space
  • Management

One area in which organizations can save a tremendous amount of power and power in a system is in the cooling system of the system itself. Ways to focus on platform-level power include:

  • Using low-power processors
  • De-popping unnecessary chips on motherboards
  • Using more efficient power supplies
  • Optimizing the platform for the data center environment

Major vendors claim that a server’s operating temperature runs between about 0C to 35C, but the reality is that most data centers invest in very expensive equipment to keep the temperature between 20C to 25C. Servers that are designed to run at an operational temperature of 35C require very high-speed fans. This is especially important for 1U servers and blade servers. Because it is common to use rows of high-speed rotary fans, the cooling system alone can consume about 40% of the system’s overall power consumption. One way to save money for customers is to guarantee a given inlet temperature that ranges, for example, between 20C to 25C, then optimize that platform to operate within that temperature range. In many cases, it is possible to double the density of a customer’s data center. Appro Solutions for HPC In addition to the XtremeServer, Appro offers the Xtreme workstation and a HyperBlade product that addresses density balance. Appro’s management software makes it easy for customers to use these products together. One way an organization can increase the reliability of its systems and also lower its power consumption is by not putting unnecessary chips on the motherboard. Appro’s XtremeServer product is quite unique because it is possible to de-pop any chip that is not being used. For example, two-socket XtremeServer products have eight memory slots per CPU in order to maximize the memory density of a platform. Appro currently ships 1U dual-socket AMD Opteron processor-based servers with about 64 gigabytes of memory. When 8-gigabyte memory modules become available and widespread, memory could increase to 128 gigabytes. The single chip set solution uses an nForce professional 2200 chip set, which customers typically use for processing capability, and they are likely to add only a high-speed interconnect like InfiniBand. In that case, adding a Mellanox card or a Q Logic card will turn the platform into a high-performance computing cluster. However, if an organization simply needs additional capabilities, one solution is to add another chip to that same platform to provide those capabilities. For example, suppose a customer’s code does not run very well on a cluster. A solution may be to run in an SMP fashion, which requires more processing and more memory on expansion slots. This strategy can turn a very efficient 1U dual socket machine into a full-fledged 3U four-socket Enterprise machine. Appro currently ships 3U four-socket AMD Opteron-based servers with 128GB of memory, which can increase to 256GB with the introduction of 8GB memory modules. The benefits of this concept of using a single motherboard for all AMD platforms are reduced validation cycle, reduced support issues, material management, and change control. The AMD-based Xtreme Series 1U servers include the following features and advantages:

  • 2-Socket Server
  • Single/Dual-core/Quad-core AMD Opteron processors
  • 16 DIMM sockets (64GB maximum)
  • SATA or SAS hot-swappable drives
  • Up to 1TB SATA or 600GB SCSI
  • Dual port Gigabit Ethernet ports
  • 1x PCI Express x16 and 1x PCI-X slot (optional)
  • ServerDome Remote Management – IPMI 2.0 compliant
  • Windows or Linux OS

Many customers encounter unnecessary expense by having to validate different platforms and managing different suites of software and firmware. Appro eliminates that expense, because all of its platforms share the same motherboard. This means customers can maintain a single set of software packages, and it is necessary to validate the platform only once. Material management is easier, as well as change control. The AMD-based Xtreme Series 3U servers include the following features and advantages:

  • 2- or 4-Socket Server
  • Single/Dual-core/Quad-core AMD Opteron processors
  • Up to 32 DIMM sockets (128GB maximum)
  • SATA or SAS hot-swappable drives
  • 2x PCI-X and 2x PCI Express x16 (full length and double-stacked)
  • Dual-port Gigabit Ethernet ports
  • Redundant power supply (3x 600W in 2+1 configuration)
  • Leading AMD medium and large-scale Fluent 6.3 benchmark (8-way)
  • FL5M1 (3627.2), FL5M2 (6540.5), FL5M3 (1403.3)
  • FL5L1 (1043.7), FL5L2 (784.2), FL5M3 (137.4)

The 1U two-socket server makes sense as a compute server, while the 3U four-socket machine with 128 GB of memory can be a powerful enterprise product that will run code intended to be run in an SMP fashion. In addition, Appro provides a great hybrid bade server. As commonly known, it is not a good strategy to fit many servers in one rack if the result is a dramatic increase in power consumption. Therefore, Appro’s HyperBlade product has a manageable density. The 80-node cluster ranges from 20 to 26 kilowatts, which results in a highly efficient data center. A 50-node cluster fits into Appro’s standard cabinet, which results in a 20% density increase with a very manageable power envelope. Appro also offers a mini-cluster that supports up to 17 nodes. Furthermore, it is possible to use the same blade and interchange it between these three platforms. HyperBlade Clusters include the following features and advantages:

  • Hybrid Blade/1U form factor
  • Eliminates cable clutter
  • Doubles the density of 1U Servers
  • Flexible, modular and scalable design
  • Up to 80 nodes in a custom rack
  • Up to 50 nodes in a standard rack
  • Up to 17 nodes in a self-contained rack
  • Simple and easy-to-use remote server management tool
  • Better scalability adds value to an organization’s Fluent license
    • Average large benchmark scalability (8 cores 5.9/16 cores 9.6)
    • Average medium benchmark scalability (8 cores 7.8/16 cores 12.2)

Appro’s new product is a 1U one-socket XtremeServer that has four cores in one socket, along with eight memory sockets. That means it is possible to have 32GB in a single-socket server machine at a lower cost. This product utilizes a single blower for cooling. Instead of using rows of 1U fans, there is a single blower to efficiently cool the system, making it an attractive power performance package. The quad core makes a single-socket server a real option in the high-performance computing space. Although clusters are still very difficult to manage and use, Appro strives to present a unified solution in its XtremeCluster and cluster management software to address that problem. Maximizing Productivity with Multi-core Technologies The second-generation AMD Opteron processor is competitive with the Intel core 2 technology, depending on the application. AMD Opteron processors are built on a consistent architecture, delivering the following advantages:

  • Continued performance-per-watt leadership, offering high-performing, low-power DDR2 memory and consistent 95W standard power and low-power options
  • Advanced leadership in x86 virtualization with AMD virtualization hardware-assisted support and industry-leading direct connect architecture
  • Reducing total cost of ownership by providing one transition to a new socket infrastructure and seamless dual-core to quad-core upgrade ability in same thermal envelope

AMD excels in the area of technical computation. Today, AMD launches its Quad-Core AMD Opteron processor for both the 2000 Series (2-way) and 8000 Series (up to 8-way). AMD’s design is a true quad-core processor without compromising performance, power or heat. The benefits of a native quad-core design include the following:

  • Optimum performance
  • Same power and thermal envelopes as dual-core

One of the enhancements being introduced with the quad core is a third-level cache that has a balanced, highly efficient cache structure. There is a dedicated L1 (level 1) that uses AMD’s 64KB/64KB (data/instruction) vs. Xeon’s 32KB/32KB and allows two loads per cycle. The dedicated L1 handles data quickly and efficiently. L2 (level 2) is a dedicated cache designed to eliminate conflicts of shared cache structures. L2 is designed for true working data sets. The dedicated L2 makes it possible to avoid thrashing and to minimize latency. The third-level cache is shared across all four cores. The shared L3 is designed for optimum memory use and allocation for multi-core. It also serves as a communication path between cores if they need to access the cache lines on any of the other cores. The shared L3 reduces latency to main memory. The entire cache structure provides efficient memory handling that reduces the need for “brute force” cache sizes. Having extra system bandwidth enables optimum quad-core performance. Dual Dynamic Power Management AMD’s new Quad-Core AMD Opteron processors also provide Dual Dynamic Power Management. With regard to standard processors, today’s AMD Opteron processor uses the same voltage for the memory controller and the core. However, with no infrastructure changes, a split plane processor can be installed in an existing Socket F (1207) board. In an optimized board, the CPU and memory controller run from different voltage supplies for greater performance and better power management. AMD also improves processor power management with Enhanced AMD PowerNow! Technology. With AMD’s native quad-core processors, frequency can be controlled independently by core, allowing for greater platform power savings. The Communication Bus The communication bus between the various AMD Opteron processors is called HyperTransport Technology, which is an open standard interface. A related contribution from AMD, which is still proprietary, is called coherent HyperTransport. All communication between AMD Opteron processors is via coherent HyperTransport. Each quad-core device has three HyperTransport buses, two of which are used in a quad-socket server to communicate between processors. The other two HyperTransport links in this quad-socket server are operational in a non-coherent sense. They are used to communicate to the outside world, for example, by using southbridges that then communicate to drives and all other external peripherals. This is the fundamental architecture of an AMD Opteron-based system. Each AMD Opteron socket supports as many as eight physical DIMMS. At four gigabytes per DIMM, this translates to 32 gigabits per socket. When using quad cores, assume each core has a dedicated Level 1 and Level 2 cache, as well as a shared Level 3 cache. AMD has implemented advanced cache filtering techniques within the actual cores, which makes it possible to create a consistent memory structure that contains cache line information that keeps track of which cache lines are active, which ones are not, and to which cores they are assigned. In addition, quad-core design allows for more efficient communication between cores compared to Intel’s multi-chip module over front-side bus approach. The advantage of a native quad-core design allows for more efficient communication between cores, compared to Intel’s multi-chip module over front-side bus approach. For example, when a core on one of the sockets needs to access data, it uses its normal cache probe structure techniques that check each level. If that data is not found, there is a request to determine which cache line is active and on which core from the memory directory. This is one way to minimize network traffic that will consume most of the bandwidth on the HyperTransport link in the case where a data access moves every cache on every core. One of the major implementations that respond directly to high-performance computing in the quad cores is the doubling of the floating-point logic in the cores themselves. On a per-core basis, it is possible to issue and retire up to four flops per cycle. This is helpful because high-performance computing applications tend to be very floating point intensive. In some applications that have been benchmarked, there has been up to an 85% performance improvement on a per-core basis in floating-point performance. For high performance computing that is floating-point intense, on certain HPC applications there is more than a 200% performance increase when moving from two to four cores. In addition to the number of cores being used, on some applications (on a per-core basis) there is more than an 85% performance improvement. AMD hopes to make an equally significant impact in terms of transaction processing environments, as well. In terms of the overall HPC market, there is very high growth in the EDA market, HPC and clusters. The market drivers for clusters include strong node-level performance, 64-bit processors, Linux, and multi-core technologies. The real issue will be how to program them. Price and performance are expected to continue growing at a very aggressive rate. The number of processors a customer can buy for a given amount of money will be growing tremendously. There is potential for dramatic changes in the market structure. Power and cooling concerns are expected to drive more system designs and purchase acquisitions. The way in which multi-core processors and memory band with an interconnect are delivered will be crucial. Therefore, there are many areas of opportunities. With regard to data center challenges, the focus is on platform-level power. The most effective solutions to these challenges include using low-power processors, de-popping unnecessary chips on the motherboard, using more efficient power supplies and optimizing the platform for the data center environment. AMD has been a clear choice in many EDA applications. AMD Opteron processors deliver outstanding performance for critical enterprise applications, are designed to reduce overall power consumption in the data center, and can save valuable IT budget dollars through lower acquisition costs and lower long-term management costs.