3 Questions: David Smith on InfiniBand

David Smith is Senior Product Manager, InfiniBand Products, at QLogic. For the past ten years he has focused on HPC server platforms, routers, storage systems and interconnects. Previously, he held product management positions at 3Par, Silicon Graphics, and 2Wire. He holds an MBA from San Jose State University. In an interview with SC Online, Mr. Smith shares his thoughts about InfiniBand as a cluster interconnect for supercomputing.

SC Online: How has InfiniBand evolved as an HPC interconnect? It is taking more market share away from Gigabit Ethernet and proprietary interconnects. Why?
 
Smith: InfiniBand has evolved significantly over the past several years, and in many cases has achieved cost/performance parity with proprietary interconnects. InfiniBand has quickly become the interconnect of choice for cluster architectures. These cluster-based architectures are gaining HPC market share as a result of their ability to cost-effectively scale. Next generation proprietary interconnects have become increasingly costly to develop.  It is not clear whether system vendors will continue to invest in their own interconnect development when InfiniBand vendors will expand the capabilities of the InfiniBand product set at a rapid pace and provide solutions for them at a lower cost.
 
As for Ethernet, it’s a general purpose medium that hasn’t evolved to support the increasingly granular HPC application requirements for low latency and high message rate. Even 10-Gigabit Ethernet doesn’t offer the latency or message rates of InfiniBand.
 
SC Online: What are the key challenges in implementing InfiniBand fabrics?
 
Smith: Since the whole point of an HPC interconnect is to deliver the ultimate in node-to-node performance and bandwidth with the least possible latency, the challenges for implementing InfiniBand have to do with optimizing the fabric for its particular environment and doing this while keeping management costs in line. There are several factors to consider.
 
Scalability – HPC clusters are growing rapidly as the demand for more compute performance outpaces even the evolution of multi-core processors. This means that as the cluster scales, more and more time is spent on communications between nodes and less time on computation. The fabric should have management tools that make communications as efficient as possible so there are more resources available for computation.
 
Efficiency – As clusters grow, the number of paths between nodes can grow exponentially. The challenge is to ensure that the traffic is always optimized for travel on the least-congested paths.
 
Optimizing the capabilities of the most widely available MPIs – There are many message passing interfaces (MPIs) in use, and server vendors customize their MPI libraries to take advantage of specific differentiators in their equipment. The InfiniBand fabric should enable high-speed, efficient communications over any MPI.
 
Support for mainstream and alternate topologies – The Fat Tree topology is the standard in clustered environments, but some facilities use Torus or mesh topologies instead. Failure handling is much more critical in Torus or mesh topologies because of the sheer size of the deployments – Torus and mesh topologies are typically used in clusters containing thousands of nodes. The InfiniBand fabric should support these topologies equally well so as not to exact a performance penalty to users of these alternatives.

SC Online: What is QLogic doing to address these challenges differently from other InfiniBand vendors?

Smith: QLogic takes a system-level approach to InfiniBand – we design ASICs, adapters, and switches, but we have also invested heavily in system architecture including host system interfaces, application messaging patterns, fabric and I/O virtualization, fabric routing, and signal integrity and modeling.  Additional software development services include scalable communication libraries, fabric management, installation services, and element management. With this broad perspective we have looked at the key challenges and specifically addressed them with features that are unique to our products.

Scalability – To improve scalability, QLogic uses a host-based interfacing approach. This is one fundamental difference between QLogic HPC solutions and those from other vendors.

In the early days of HPC clusters, we were dealing with single-core Pentium-class systems. Back then, other vendors employed the strategy of offloading everything to the adapters—a good idea back then. But with multi-core processors and larger clusters, the load on those on-board processors becomes too great and they actually become a bottleneck. Some of our tests show that the bottleneck begins with as little as 4 or 5 cores in a single node.

QLogic uses a host-based processing approach, so we specifically developed our host software and host silicon to excel at MPI workloads. As a result, the performance of our QDR InfiniBand HCAs actually increases as the number of cores in a cluster scales upward. We outperform competitive HCAs by more than 22% on a 256-node cluster running real-world applications, so we deliver better performance on small clusters and even better performance on large clusters.  (This is based on SPEC_MP12007 results submitted in July 2009.)

Offloading work to adapters requires more power and space and is ultimately a flawed design that doesn’t take advantage of all the advancements on the server side. The competition is fighting against Moore’s Law. QLogic is enabling businesses to capitalize on it.

Efficiency – Treated as a single pipe, the HPC fabric can suffer significant losses in performance as multiple applications contest for resources. For example, if a message-intensive application is running with one microsecond latency, adding a storage application can increase the overall latency by up to 400 percent.

We use a feature called virtual fabrics with classes of service to eliminate resource contention and optimize performance for every application. Virtual fabrics is the ability to segregate traffic into different priority classes. A user may have different jobs that require different priorities, or he may decide to separate different traffic types into differing priority classes—such as compute traffic, storage traffic and management traffic. QLogic helps users to actually partition traffic flows to make sure that storage traffic doesn’t interfere with critical compute traffic.

QLogic’s virtual fabrics capability supports up to 16 service classes simultaneously. If a network administrator understands which applications must be supported and how the workloads occur on the fabric, he or she can use virtual fabrics to automatically optimize the fabric’s resources to ensure maximum performance for every job.

Another related feature is adaptive routing, which also makes the fabric much more efficient. Most HPC fabrics are designed to enable multiple paths between switches, but standard InfiniBand switches don’t necessarily take advantage of these paths to reduce congestion. As implemented by QLogic, adaptive routing is a capability that shifts network traffic from over-utilized links to less utilized links. Adaptive routing leverages intelligence in the switches themselves to maintain awareness of the performance of every available path, and to automatically choose the least congested path for each traffic flow.

QLogic’s implementation of adaptive routing is a great example of how we think through the entire problem before delivering a feature. Although other vendors also have adaptive routing, there are several key differences:

  • Fabric intelligence vs. subnet manager intelligence – QLogic’s Adaptive Routing is built into the switch chips themselves, so the decisions about the ideal path are made in the switch. Our competitors’ implementation relies on the subnet manager, which recalculates routes and then sends commands to the switches for execution. It is far faster and more efficient for the switches themselves to make the decision, and this eliminates the chance that the subnet manager itself can become a bottleneck.
  • Scalable fabric intelligence – by incorporating adaptive routing directly into its switch chips, QLogic actually allows the path selection intelligence to scale as the fabric grows: as switches are added, we add to the overall pool of knowledge about path characteristics as well. This is not the case with competitive products because they always rely on the subnet manager.
  • Managing flows, not packets – QLogic’s approach is more in tune with the realities of fabric operations. Switches work on a per-flow basis, not a per-packet basis. Competitive products will continue to send packets down a congested path, or will send packets on an alternate path only to have them arrive at the destination out of order. At that point, they need to be reordered, which causes delays. By working on a per-flow basis, QLogic’s fabric software optimizes transport of the flow, not particular packets, and it can automatically re-route part of a flow down a different path when there is congestion and will automatically reassemble packets from divergent paths in the proper order for optimized processing at the destination.

Another important differentiator is QLogic’s distributed adaptive routing, which increases intelligence as the cluster scales. With distributed adaptive routing, every TrueScale ASIC that we put into the network has its own integrated RISC microprocessor with its own ability to evaluate local conditions and make determinations in conjunction with the other ASICs. So the more switches you add, the more microprocessors you add. This means that you are adding more intelligence to your HPC network. Host adapters are aware of multiple paths to any destination, and use routing intelligence to figure out how to route packets at any given time.

We also use a feature called dispersive routing to automatically load-balance traffic across the network, effectively leveraging the entire fabric to optimize application performance. Working via the host adapters, QLogic’s dispersive routing distributes traffic over multiple paths to a destination to load-balance the network while ensuring the likelihood that packets sent via disparate routes arrive in the proper order for processing at their destination.

MPI Support – While the message passing interface (MPI) is a common denominator in InfiniBand software support, hardware companies usually supply vendor-specific MPI libraries that optimize performance for their products. MPI libraries are developed with different focuses – performance or scalability, for example, and open source MPIs are not as feature-rich as some of the commercial ones. For example, not all MPIs have native QoS support.

Competitors’ fabric software typically only supports Open MPI libraries, so it does not take advantage of custom features in vendor-specific products. For example, another software company will support Open MPI in the OFD stack, but it adds no value to the vendor-specific libraries that are out there. With our software, QLogic enables all features for every vendor-specific MPI, so the advanced routing, path selection, and class of service features work in any customer environment without compromising performance.

Topology Flexibility – Fat Tree is the most common topology used in HPC cluster environments, but it’s not the only one. QLogic is the only company to fully support alternative topologies.

Many HPC centers move to Torus or Mesh topologies when the cluster grows to 2000 nodes or more because these topologies require fewer switches and are thus less expensive to use than Fat Tree.

Torus and mesh topologies enable full bandwidth to the nearest neighboring nodes, but they don’t provide as much bandwidth to nodes beyond that. This takes about 50 percent of the cost out of the network, but it also means that applications must be very intelligent in terms of how they distribute the workload across nodes.

QLogic supports these alternative topologies with built-in browsing. Applications can leverage the built-in browsing in the QLogic software to quickly build accurate maps of where to distribute workloads for optimum performance. As a result, QLogic’s implementation allows for much better performance in Torus or mesh topologies than the competition.

SC Online wishes to thank David Smith for sharing his time and insights.