3 Questions: Asaf Somekh on Voltaire Fabric Collective Accelerator

"3 Questions" is a new series from SC Online that gives members of the community the opportunity to sound off on current events in their field of expertise. In this, the first installment, Asaf Somekh, VP of marketing at Voltaire, discusses his firm's groundbreaking new solution that accelerates distributed applications.

Q. Please tell us about your company's new solution and the implications for customers in the supercomputing space.

A. Voltaire announced a ground-breaking new InfiniBand software and hardware solution that accelerates distributed application group communications in scale-out fabrics. The software solution is Voltaire’s patent pending Fabric Collective Accelerator (FCA), which accelerates MPI (Messaging Passing Interface) collective operations by using Voltaire switches and their on-board processors to offload significant parts of group communication onto the switching fabric. At the same time, Voltaire’s Unified Fabric Manager (UFM) software orchestrates an efficient, topology-based collective flow. Working in concert, these products ensure all bottlenecks are removed at the server and interconnect levels. The computational acceleration is achieved transparently without requiring changes to the application. As a result of group communication acceleration by a factor of up to ten times faster the entire application run-time is reduced. Voltaire FCA accelerates high performance computing applications such as reservoir modeling, fluid dynamics, crash analysis and others.

This is the industry’s first fully integrated solution to offload collective operations across the full fabric topology. Using the Voltaire FCA software combined with the intelligence on the switch dramatically accelerates the collectives process, so customers can boost application performance by orders of magnitude. Voltaire FCA works with Voltaire’s complete portfolio of 40 Gb/s InfiniBand switches and will be available in early Q2 2010.

Q. What are the key challenges for adding new infrastructure at supercomputing sites? How does the Voltaire Fabric Collective Accelerator (FCA) address the problem end to end?

A. Group communications challenges are common in cloud computing and high performance computing and addressing them is instrumental in scaling out application performance. With their global, synchronous nature, collective operations can become the bottleneck when scaling MPI-based parallel applications to thousands of computers. They also can have a significant impact on the scalability of some applications in smaller standards-based clusters. Voltaire FCA provides a way to help remove the congestion created by collective operations in HPC.

The completeness and simplicity of the solution starts with the fact that it is transparent to the application and does not require code changes. The FCA agent plugs-in the MPI library in run-time and assumes the handling of the collective operation in the transport layer. From that point on the group computation is offloaded to the Voltaire switching fabric. The solution is automated end to end through Voltaire UFM, which controls the configuration of all the solution elements. UFM integrates with the application scheduler and by knowing which servers are involved with each computation; UFM can configure the fabric and feed FCA algorithms with the proper information.

Q. Capacity clusters are the most common and are used to deliver a certain amount of supercomputing capacity to the end users. For example, a capacity cluster may support hundreds of users running any number of programs. A Capability cluster is designed to handle or be capable of running large groundbreaking programs that were previously not possible to run. These systems usually push the limits of cluster technology because large numbers of systems must work together for long periods of time. In contrast, a capacity cluster has the ability to tolerate failure and continue running user programs. Will Voltaire's FCA bring InfiniBand to Capability Clusters?

A. FCA as an enabler of capability computing of commodity based clusters

An InfiniBand interconnect matched with X86 based commodity servers is known to be the leading and growing standard technology for capacity clusters. Many customers use these clusters today to run hundreds of concurrent jobs simultaneously. InfiniBand allows the use of Multipathing natively, so the breakdown of any of the core switches reduces the bandwidth capacity of the cluster but keeps it running with no interruption to the applications. Together with Voltaire fabric management software, high availability is extended further allowing any failure of an edge switch or the subnet manager entity to be immediately handed off to a secondary entity while keeping the fabric traffic uninterrupted.

Let’s first look at the reasons which kept X86 commodity based clusters from addressing the capability computing category effectively so far. We observe a phenomenon where above a certain cluster size we see a diminishing marginal capability computing addition for every node added. In fact at a certain size, each server added reduces the capability of the entire cluster for a task that involves the all the servers.

The reason stems from the fact that group communication becomes ineffective. When all the servers need to synchronize and talk to each other as in collective operations, the weakest link in the chain determines the pace. There are a few reasons that can be named as the cause for group communication slow down, namely OS noise or jitter, congestion and buffering in the fabric, non-linear group communication patterns and the non-consideration of fabric topology when setting the group communication pattern. Some of these challenges have been engineered out in capability supercomputers such as Cray but have not been addressed to date by scale-out commodity clusters.

Indeed the message is that FCA opens up the realm of using commodity clusters based on X86 processors in capability computing as defined below (“A Capability cluster is designed to handle or be capable of running large groundbreaking programs that were previously not possible to run. These systems usually push the limits of cluster technology because large numbers of systems must work together for long periods of time”).

Let’s review the FCA solution components and see how they address the problem end to end. FCA needs to assure that group communication has no variance and therefore, no “weak link” that would extend the latency for the entire group. The most important step is to offload the servers from group communication. This rids group communication from the most non-deterministic factor, the Linux operating system.

As a second measure we must isolate group communication from the rest of the communication in the fabric, both logically and physically. FCA does this by making sure that all the collective group communication is performed on a separate physical and logical private network (V-lane). These two factors are critical, but need to be complemented by the right communication pattern.

Group communication has a logical profile. Some servers talk to others and their result is carried on to other server groups in a form of a logical tree. This tree today is set up with no knowledge of the physical topology and node proximity, thus creating inefficiency where servers of far ends of the fabric need to talk to each other. FCA introduces a proprietary, patent pending, set of algorithms that take into account the physical layout of the fabric and make sure that group communication is optimized within the cores of the same server and thereafter among neighbor servers. This assures that the longest path of communication is also the shortest possible and that there are no inefficiencies in server to server communication.

FCA is tied also in a pragmatic approach that requires quick setup and configuration of the fabric according to the application, in order to assure that all the elements are tuned according to the group of servers selected for the capability task. Voltaire Unified Fabric Manager (UFM) integrates with the application scheduler and assures that as the application capability job starts running, that the group communication (MPI) patterns are optimized in the way described above for that specific job.

As described in our recent announcement, internal and customer initial tests show already an improvement factor of 10X on group communication. This amazing empiric data aligns with our theoretical simulations and manifests the breakthrough of commodity based cluster computing, with the help of FCA, into the domain of capability computing.

SC Online wishes to thank Asaf Somekhfor his time and insights.