CUDA Python, here we come: Nvidia offers Python devs the gift of GPU acceleration

Nvidia’s CUDA team has released version 11.5 of its parallel computing platform and pushed a Python-flavoured subproject forward in the process.

With the update, CUDA Python leaves its preview phase behind and should now be fit for general usage. The collection of Cython bindings, Python API wrappers, and libraries is meant to help data scientists and developers from the fields of HPC and ML to use Nvidia GPUs to accelerate their projects by providing them with a simpler way of accessing the CUDA host APIs. 

When used correctly, Nvidia promises for CUDA Python code to be about as performant as its C++ equivalent. The project is meant to work on all platforms also supporting CUDA, and needs Cython 0.29.24, pytest 6.2.4, pytest-benchmark 3.4.1, and numpy 1.21.1 (or newer versions) to be fully functioning.

Going back to CUDA itself, version 11.5 has learned to work with signed and unsigned normalised 8- and 16-bit types and was fitted with a first version of a new __int128 data type. Developers however shouldn’t expect too much from the latter yet, as the new type still lacks broad support for math operations, libraries, and dev tools at this point. Starting with the current release, floating point division should work a bit faster when the divisor is known at compile time, though users have to enable the corresponding optimisation via nvcc -Xcicc -opt-fdiv=1 first.

Users looking for a specific behaviour when caching data on the device side can try configuring it now using annotated pointers. CUDA 11.5 also includes the option of setting up per-process memory access policies to have more control over multiple processes sharing GPUs, and there are new functions for inclusive and exclusive scans for cooperative groups available.

Other than that the CUDA compiler team was able to rework the component so that it can now link with cubins larger than 2GB, and supports numerous pragmas for more control over diagnostic messages. Developers can now use builtin_assume to specify address space to allow for efficient loads and stores, and set the -arch=all or -arch=all-major options to generate code for multiple architectures at once.

Library cuBLAS comes with new auxiliary functions cublasGetStatusName(), cublasGetStatusString(), additional epilogue options to support fusion in DLtraining, and vector alpha support for per-row scaling in TN int32 math Matmul with int8 output as part of the release.
The CUDA team hasn’t deprecated any features with this update, though Nvidia driver support for Kepler is removed beginning with R495. Things look slightly different for the  Math library though, so checking the project’s documentation is advised to learn about deprecated APIs.