STREAMlined Performance

Detailed understanding of microprocessors improves HPC efficiency at TACC

Story Highlights:

  • Microprocessors achieve only a fraction of their performance potential for many applications, limiting the effective science achieved through computational analysis.

  • Much of this performance deficit is caused by the complexity of modern memory systems, which serve as a bottleneck for many codes.

  • To remedy this problem, TACC research scientist John McCalpin is developing analysis and optimization tools that allow users to interpret the performance of their code and make significant improvements.

 

John McCalpin knows a thing or two about microprocessors.

Having worked as a performance modeler and analyst on SGI’s early NUMA architecture, on IBM’s Power 4, 5, and 6 processors, and on AMD’s microprocessor development team, he’s something of an expert on the subject of chip performance.

In fact, before he even began his career in industry, McCalpin developed the STREAM benchmark—an industry-standard performance measure that has been used since 1991 to determine the "real world" bandwidth sustainable from user programs.

As a research scientist at the Texas Advanced Computing Center (TACC), these experiences afford McCalpin unique insights into the performance potential and limits of processors, which, in turn, generate ideas for new architectures and implementations in the future.

TACC research scientist, John McCalpin, is on a mission to help users squeeze more performance from TACC’s high-performance computing systems.

Things weren’t always so complicated. “It’s difficult to even describe how much more complex computers are now than they were 15, 10, or even 5 years ago,” McCalpin says.

For the early microprocessor-based systems, memory was fairly simple and improvements to computing performance were drawn primarily from the ability to increase the speed of calculations. However, today’s systems have many more levels to their hierarchical memory structures, with each generation adding new complexities that create performance bottlenecks.

“There are a lot more features to be aware of when you’re designing your code for modern systems,” said McCalpin. “Every aspect of the systems is more complex, but much of the trouble comes from the memory. The memory subsystem is composed of many layers of address translation and mapping, and it’s very easy to run into conflicts or not be able to exploit the performance potential of the system.”

McCalpin is on a mission to help users squeeze more performance from TACC’s high-performance computing systems. The first task is to identify whether a code is running well or not. This requires understanding both the performance capability of the system and the performance achieved by the code. To help users measure the performance capabilities of the system, McCalpin is developing a comprehensive suite of performance metrics for Ranger based on common programming kernels, such as reading data from cache or storing data to memory attached to a different processor chip. The goal is to create both a set of equations describing the “best case” performance and a short (though perhaps not simple) example program demonstrating how to get performance close to these limits.

To determine how well a user’s code is running requires access to the hardware performance monitors included in Ranger’s Opteron microprocessors, along with detailed understanding of the inner workings of the processors. Some of these performance monitors have been accessible since early in Ranger’s lifetime; while access to the memory system and chip-to-chip interconnect counters will be provided as soon as a Linux kernel patch developed by McCalpin is rolled out to Ranger’s 3,936 compute nodes.

According to McCalpin, the application performance limiters in a system are often related to hardware implementation details that most users don’t know about. For that reason, McCalpin is reviewing information from industry and working with TACC’s kernel and application performance teams to make the center’s supercomputers as efficient as possible.

"At the petascale and exascale level,
the fundamental metrics are related to 
communication performance, computational 
performance per watt, and recovery from 
transient errors."

John McCalpin, TACC research scientist

To maximize the usefulness of this detailed understanding of the hardware, McCalpin is also working with Professors Jim Browne, Keshav Pingali and Stephen Keckler in the Computer Science department at The University of Texas at Austin to develop “smart” analysis tools that take in dozens (or hundreds) of performance measurements, combine these with equations describing the system performance characteristics, filter out the information that is either not important or not helpful, and provide the user with the simplest possible suggestions for changes that may improve performance.


The STREAM benchmark has been used since 1991 to 
determine the "real world" bandwidth sustainable 
from user programs.

“People know that caches and memory banks exist, but they don’t know the details of how data is assigned to memory banks or the details of how data moves between memory and the various caches,” said McCalpin. “Very few people know how to manage the memory at this level of detail, but failing to get the details right can easily cost a factor of two or more in performance due to memory stalls and other conflicts.”

Another approach to understanding performance is to measure the sensitivity of an application’s performance to changes in CPU frequency, memory bandwidth, memory latency, and interconnect performance. On Ranger and other TACC systems, McCalpin has been testing and analyzing application performance with various hardware configurations, creating models that can be used to estimate performance on different configurations of similar hardware.

Eventually, McCalpin intends to combine his understanding of chip performance with the lessons he is learning from TACC’s user applications to design HPC systems at the peta- and exascale.

“As system scale increases, the characteristics that are fundamentally important to performance change as well,” McCalpin explained. “For an application running on a single server, CPU performance and memory bandwidth are of paramount importance. At the petascale and exascale level, the fundamental metrics are related to communication performance, computational performance per watt, and recovery from transient errors.

“I’m trying to help people understand how the hardware works so they can get the best performance from TACC’s systems for their applications, now and in the future.”