Better, faster, stronger: CUDA 11.6 puts finishing touches to 128-bit int support

The toolkit for Nvidia’s parallel computing platform CUDA recently got updated and is now on version 11.6. With performance and programming model enhancements in tow, the new version is hoped to support a wider array of HPC and data science applications.

Highlights of the release include the GSP driver architecture becoming the default driver mode for Nvidia’s more recent Turing and Ampere GPUs, as well as a new application programming interface that allows to disable kernel nodes of an instantiated graph. Disabled nodes act like empty ones would, though the modification only affects future launches of the graph. Node parameters are promised to stay the same while the node is disabled.

Another interesting enhancement comes in the form of new functions for the cooperative groups namespace. The additions include ways to learn about the dimension and number of threads and blocks within a thread block or grid group respectively, and are meant to “improve consistency in naming, function scope, and unit dimension and size”.

While support for 128-bit integers in CUDA C++ already was a part of the last release, v11.6 takes the implementation a step further, and expands the capability to use the data type in compilers and developer tools as well.

CUDA 11.6 should be able to use the latest Visual Studio 2022 as a host compiler, and will automatically prune unused kernels to improve performance. Parallel thread execution (PTX) models come fitted with new instructions for creating bit masks and using sign extension starting with the release. 

Developers can also configure device linker nvlink to generate PTX, which should be helpful for scenarios that use optimisation at device link time but require forward compatibility across GPU architectures.

As usual, the CUDA platform update includes some library enhancements, though most seem to be about performance this time around. Notable new features not belonging to that realm are a new API for computing Absolute Manhattan distance transforms in NPP and options to realise fusion in DLtraining in cuBLAS.

Updating installations to version 11.6 should be relatively straightforward, however users should be aware that support for CentOS Linux 8 as well as the cudaDeviceSynchronize() function have been deprecated. According to Nvidia, a better performing replacement programming model for the second is planned to be added soon.