Performance Improvements from CUDA BLAS Integration
As GPU hardware becomes more prevalent in both research and commercial institutions, software that takes advantage of this specialized hardware is growing in demand. In many cases, it is infeasible or impossible to rewrite an existing program to run entirely on the GPU, so the goal is often to offload as much work as possible. The IMSL Fortran Library offloads CPU work to NVIDIA GPU hardware where the CUDA BLAS library is utilized. Users with supported hardware will be able to link the IMSL Fortran Library with CUDA BLAS version 3.1 to gain significant performance improvements for many linear algebra functions. The calling sequences for IMSL functions are untouched, so there is no learning curve and users can be immediately productive.
This graph shows the speedup of double precision matrix multiply (DGEMM) across several problem sizes (500 square to 8000 square) using the NVIDIA CUBLAS algorithm. Against pure Fortran code, code executes over 100 times faster moving to the GPU. Compared to hardware-optimized BLAS using 4 CPU threads, performance can improve up to 4 times faster.
The many new and updated algorithms in the IMSL Fortran Library version 7.0 provide unique numerical analysis techniques to customers in major corporations, academic institutions, and research laboratories worldwide. There are several enhancements and new functions including:
- New Feynman-Kac algorithm that solves Black-Scholes problems
- New time series modeling algorithms including Automaic ARIMA, Regression ARIMA, and AUTO-PARM to detect structural breaks in time series
- Integration of ARPACK functions including Partial Singular Value Decomposition (SVD) and symmetric, non-symmetric and complex eigenvalue problems
- New statistical functions including Partial Least Squares (PLS) Regression and Maximum Likelihood Estimation (MLE)
- Additional statistical tools include random copula routines and several noncentral distributions