CUDA introduces developers to a number of new concepts (such as kernels, streams, warps, and explicitly multilevel memory) that are not encountered in serial or other parallel programming paradigms. Visibility into these elements is critical for troubleshooting and tuning applications that make use of CUDA. This paper will highlight CUDA concepts implemented in CUDA 3.0 - 4.0, the impact of those concepts for troubleshooting CUDA, and how TotalView helps users deal with these new CUDA-specific constructs. CUDA is frequently used alongside MPI parallelism and host-side multicore and multithread parallelism. The TotalView parallel debugger provides developers with an integrated view of all three levels of parallelism within a single debugging session.