TotalView Achieves Massive Milestone Towards Exascale Debugging
At the SC12 conference, Rogue Wave SoftwareM, the largest independent provider of cross-platform software development tools and embedded components for the next generation of HPC applications, announced that TotalView® has achieved a significant debugging milestone during testing conducted as part of its strategic scalability initiative. During the testing, TotalView demonstrated its capability to debug a parallel job running on 786,432 processor cores. The tests were conducted on Lawrence Livermore National Laboratoryâ€™s (LLNL) Sequoia, its IBM® Blue Gene/Q® supercomputer. These scalability tests are key to advancing Rogue Waveâ€™s strategic business goal of providing leading tools that scale with its customersâ€™ applications on todayâ€™s petascale computers and to ensure that TotalView is well positioned for the industryâ€™s move towards exascale computing. Sequoia serves the National Nuclear Security Administrationâ€™s Advanced Simulation and Computing (ASC) program, a cornerstone of the effort to ensure the safety, security, and reliability of the nationâ€™s nuclear deterrent without underground testing.
â€œWe are actively working to increase the capabilities of our scientific codes to scale and take advantage of the phenomenal power of Sequoia. As part of this effort, we are looking for ways to get more on-node parallelism from existing codes and architecting our new codes to support the even more massive degrees of parallelism that we know will be needed in the future,â€ stated Scott Futral, LLNL group leader for Development Environment. â€œRogue Wave's dedication to pushing for ever-increasing scales with its TotalView debugger and the recent tests give us reason to be confident that TotalView will continue to be a critical development tool as we reach higher and higher scales with our own codes.â€
Rogue Waveâ€™s scalability initiative, which is a partnership with LLNL and LLNL's Tri-Lab partners (Los Alamos National Laboratory and Sandia National Laboratory), features a multi-architecture approach, targeting the Blue Gene/Q platform, along with x86-based architectures, like the Cray® XEâ„¢. Extreme-scale testing allows TotalView engineers to identify bottlenecks and prioritize efforts in optimizing and tuning the debugging engine for scalability. During the most recent testing session, TotalView successfully scaled across 786,432 cores, with no indication of the debugger hitting any barriers.
Rogue Wave conducted this test using a hybrid MPI + OpenMP code that implements a method for solving a system of linear equations. This application, which makes use of both MPI for distributed memory multi-process parallelism and OpenMP for shared memory thread based parallelism, was selected because it shares important characteristics with many applications used on extreme scale systems, such as Sequoia. This kind of attention to the workloads of large-scale systems is another key aspect of scalability requirements.
Since there was no indication of any barrier being hit at the 786,432 core mark, the testing suggests that TotalView could have leveraged more of Sequoia's 1.5 million cores if additional compute nodes had been available. In order to further push TotalViewâ€™s scalability, additional tests oversubscribed the machine by spinning up more than one thread per core. Rogue Wave will announce the result of this second set of tests, which demonstrate successful debugging of an even higher number of threads, on Thursday November 15th at 12:00 PM MST. Rogue Wave invites SC12 attendees to visit its booth, #3418, to participate in a competition to correctly guess the number of threads TotalView debugged. [November 15, 2012 note: TotalView debugged 1,048,576 threads.]
TotalView® is a highly scalable debugger that provides troubleshooting for a wide variety of applications including: serial, parallel, multi-threaded, multiprocess, and remote applications. Designed for developer productivity, TotalView simplifies and shortens the process of developing, debugging, and optimizing complex code. It provides a unique combination of capabilities for pinpointing and fixing hard-to-reproduce bugs, memory leaks, and performance issues. TotalView raises the bar for debugging by providing several additional features at no extra cost, including debugging for CUDA, OpenACC and deterministic reverse debugging, which allows users to pause, rewind and playback the sessions to accurately identify and correct errors.