Performance Improvements from CUDA BLAS
Integration
As GPU hardware becomes more prevalent in both research and
commercial institutions, software that takes advantage of this specialized
hardware is growing in demand. In many cases, it is infeasible or impossible to
rewrite an existing program to run entirely on the GPU, so the goal is often to
offload as much work as possible. The IMSL C Library offloads CPU work to
NVIDIA GPU hardware where the CUDA BLAS library is utilized. Users with
supported hardware will be able to link the IMSL C Library with CUDA BLAS
version 4.0 to gain significant performance improvements for many linear
algebra functions. The calling sequences for IMSL functions are untouched, so
there is no learning curve and users can be immediately productive.
This graph shows the speedup of double precision matrix multiply (DGEMM) across
several problem sizes (500 square to 8000 square) using the NVIDIA CUBLAS
algorithm. Against pure C code, code executes over 100 times faster
moving to the GPU. Compared to hardware-optimized BLAS using 4 CPU threads,
performance can improve up to 4 times faster.

The many new and updated algorithms in the IMSL C Library
version 8.0 provide unique numerical analysis techniques to customers in major
corporations, academic institutions, and research laboratories worldwide. There
are several enhancements and new functions including:
New Optimization Functions
- Sparse Linear Programming
- Sparse Quadratic Programming
- Non-negative Matrix Factorization
Additional Regression Routines
- Partial Least Squares Regression
- Logistic Regression
Time Series Enhancements
- Auto-PARM to Estimate Structural Breaks
- Regression ARIMA
- Holt-Winters Exponential Smoothing
Statistical Functions
- False Discovery Rates
- Maximum Likelihood Estimates
- Anderson-Darling Test
- Cramer-von Mises Test
Many other new functions, including:
- SuperLU
- Non-negative Least Squares
- Non-central Beta Distribution
- N-Dimensional Spline Interpolation