Technology Overview

An Introduction to Memory Optimization

Today’s complex computer architectures and their deep memory hierarchies are a poor match for most applications. Due to the wide memory gap processors often devote more than half of their time waiting for data to arrive. This problem is expected to become worse with the introduction of multicore processors for two reasons: decreased cache area per thread and more concurrent threads contending for bandwidth. These effects are generally considered to remain the major bottlenecks for many years to come.

By removing these bottlenecks you could improve the value of an application in many different ways:

  • Increase throughput, calculations finish faster, better precision
  • Use hardware investments better
  • Extend system life time, by freeing cycles for new features
  • Select cheaper hardware for a given task
  • Reduce power consumption

Until now, however, performance analysis tools have forced developers to wade through a mass of data before they have any idea where the performance problems are.  Even then, the process seems to require more magic than engineering skills to identify the specific nature of the problem. And worst of all, they may then spend a vast amount of time trying to identify and fix problems without  any guarantee that there will be significant performance improvements to your application when you have finished.

Enter: ThreadSpotter

ThreadSpotter is  a new generation of performance analysis tool. When other performance tools dump a haystack of data in front of you, ThreadSpotter will lift the haystack to point out the needles, classify them and help the programmer by explaining ways to remove the problems.

ThreadSpotter makes performance experts more productive, and it educates programmers in which programming techniques work well with the hardware.

ThreadSpotter offers a solid detailed understanding of cache performance problems to allow quick resolution, while enabling software architects to work more generally to find those modules that make others suffer in a representative execution environment.

What is in your cache?

It is a well-known fact that memory accesses take a disproportionately long time compared to arithmetic operations. During the time it takes to access memory, the CPU can easily finish many hundreds of other instructions. Caching techniques can often hide latency by storing recently used data in much smaller, yet much faster cache memories. However, this is only effective if the “right” data happens to be in the cache at the “right” time.

The two largest enemies to achieving good cache utilization and minimizing bandwidth consumption are wasted space and lack of data reuse, also known as locality. Wasted space means that data, which is not needed by the application, occupies precious cache space. Lack of locality implies that data is not reused enough while residing in the cache.

While the importance of these two concepts may be simple to understand at this level, understanding their impact on an application may be less obvious.

It comes down to reasoning about memory layout, access patterns and data sizes in relationship to hardware properties, which is far from normal abstractions for programmers.

Memory system statistics

ThreadSpotter will gently monitor the execution of unmodified application binaries running with representative, full data sets, and capture sparse memory fingerprints representing the essence of the application’s locality properties. No restart or recompilation of the application is necessary.

ThreadSpotter will analyze the memory fingerprint and calculate the program performance in terms of high-level metrics, such as:

  • Cache miss probability
  • Fetch probability
  • Cache line utilization (Percentage useful vs. wasted cache space)

In fact, the memory fingerprint carries all relevant information for ThreadSpotter to be able to extrapolate these metrics to any cache, such as all the different cache levels of a memory system. It can also accurately predict these metrics for cache constellations different from the architecture where the fingerprint was acquired.

Cache line utilization is a key metric, as it relates to the amount of useful space and the amount of wasted space in caches. Until now, it has not been possible to measure this efficiently. Fetch probability is mostly related to bandwidth problems and requires hardware prefetch probability to be taken into account – yet another important property not given enough attention in the first generation of performance tools.

Based on the primary metrics above, ThreadSpotter will derive secondary statistical metrics:

  • Fetch rate
  • Miss rate
  • Estimated fetch rate when cache line utilization issues are fixed

Fetch rate tells on average how many fetches are performed per data request. This relates to bandwidth and is directly responsible for performance degradation due to contention on the memory bus.

Miss rate is the number of stalls per data request due to the cache not containing the requested data. This takes lack of prefetching into account. High miss rates limit the performance for some programs, while others are limited by bandwidth.

Estimated fetch rate explains how the bandwidth demand imposed on the system will be reduced if the program was altered to achieve perfect cache line utilization. This is important to gauge the potential of each advice.

Per program, loop or instruction

All these statistical metrics can be reported for the entire program. This gives an overall view of how cache-friendly the program is and indicates the potential gain from performance improvements. The user can drill down into the program and get metrics per loop or even all the way down to the source code line and to machine code instructions, giving unprecedented insight into memory and performance related issues.