Developers who were excited by Nvidia’s May announcement of an upcoming CUDA release can finally hop over to the company’s dev portal to download version 11 of the parallel computing platform and programming model for GPUs.
The number of those actually able to make the most of CUDA 11 seems to be comparatively small, given that its most notable features can be subsumed under support for the newest generation of Nvidia GPUs. The A100 is one example, built with the new Ampere architecture that should now work well with CUDA. It was developed to help compute some of the more complex tasks that can be found in the realms of AI, data analytics, and high-performance computing and is also central to the company’s data centre platform.
Amongst other things it is fitted with more streaming multiprocessors, faster memory, and special hardware like Tensor Cores, video decoder units, a JPEG decoder, and optical flow accelerators. Nvidia also claims that the A100 is able to “efficiently scale to thousands of GPUs or [..] be partitioned into seven GPU instances to accelerate workloads of all sizes”.
To be able to efficiently use all of these improvements, CUDA had to get some additions which are now part of v11. Programming examples can be found in the outfit’s initial blog from May.
Teams interested to see if the new GPU could help them getting their data-heavy tasks done without committing to buying one can try to get into an alpha programme for a new Google offering. Just yesterday the search giant cum cloud provider introduced an Ampere-based new generation of VMs for the Google Compute Engine. Accelerator-Optimised VM (A2) comprises five configurations ranging from 1 to 16 GPUs with 85 to 1360GB RAM. Pricing information isn’t available at this point, though interested parties are asked to get in touch for alpha access.
But back to CUDA itself. The new release comes with added support for Ubuntu 20.04 LTS on x86_64 platforms and Arm server platforms. It has been altered to be able to work with input data type formats Bfloat16, TF32, and FP64, which can for example help to achieve a higher data throughput. It is also the first version to include software component library CUB into the CUDA Toolkit, as it is now part of the C++ core libraries of the platform.
CUDA users looking for ways to optimise their code can give link time optimisations such as inlining code across files a go. Appropriate options are part of the nvcc library, which should now also be able to work with C++17 and a slew of new compiler versions. It’s also worth noting, that components in the CUDA toolkit are now versioned independently, giving the internal teams more control over their release plans.
In terms of CUDA tools, CUDA 11 comes with a new Compute Sanitizer. The software is meant to help developers become aware of out-of-bounds memory accesses and race conditions – much like cuda-memcheck, which is intended to be replaced by the addition.
The standalone kernel profiling helper Nsight Compute has meanwhile been fitted with a capability to generate so-called roofline models for an application. The latter combines floating-point performance, arithmetic intensity, and memory bandwidth into a two-dimensional plot. This can be used for optimisation purposes, but also allows users to check if a kernel’s performance is more determined by its central processor’s speed or the memory needed for computations.
Other recent enhancements, which can also be found in the CUDA toolkit documentation, include new Generic APIs for Axpby (cusparseAxpby), Scatter (cusparseScatter), Gather (cusparseGather), Givens rotation (cusparseRot) in cuSPARSE, as well as single process multi-GPU Cholesky factorization capabilities POTRF, POTRS and POTRI in cusolverMG library, and the option to allocate separate memory pools for each chroma subsampling format in nvJPEG.