Parallel computing platform CUDA has been released in v10.1, bringingimprovements that might be interesting for those looking for ways to speed up their machine learning programs through GPU-usage.
One of the new additions in CUDA 10.1 is cuBLASLt. As part of the cuBLAS library that offers GPU-accelerated implementations of standard basic algebra subroutines is is meant as a lightweight tool to conduct general matrix-to-matrix multiply (GEMM) operations. It is packaged as a separate binary and a header file and will let users program parameters for more flexibility when it comes to choosing implementations and heuristics.
CUDA libraries of interest to machine learning enthusiasts that have been updated include cuSOLVER, cuSPARSE, cuFFT, and the Nvidia performance primitives library for image, video, and signal processing. While the latter now supports the FP16 (__half) data type on Volta and newer GPU architectures as well as application-managed stream contexts, cuFFT has mainly become faster and more scalable.
cuSOLVER now includes a new selective eigensolver functionality for standard and generalised eigenvalue problems, a new API for computing the inverse of a symmetric positive definite matrix using Cholesky factorisation, and a way to faster compute an approximate singular value decomposition for tall, skinny matrices.
cuSPARSE was fitted with a new COO matrix-matrix multiplication implementation, new generic sparse x dense matrix multiply APIs that capsule some legacy API’s functionality, and additional algorithms for format conversions that are supposed to be quicker while needing less memory.
CUDA 10.1 supports a few more operating systems than its predecessors, with Ubuntu 18.10, RHEL 7.6, Fedora 29, SUSE SLES 12.4, Windows Server 2019, and Windows 10 (October 2018 Update) added to the list.
Other than that, Nvidia Nsight Systems, a performance analysis tool with algorithm visualisation capabilities, has been updated by reworking the command line interface amongst other things. It now supports command files, includes a new status command and users can set a collection threshold for tracing operating system runtime events.
Windows target process sampling as well as stutter analysis reports for DX12 have also been introduced. Users should note though, that profiling of more than 5 minutes isn’t officially supported yet.
Nsight Compute, a kernel profiler for CUDA applications, was bumped to 2019.1 with an option to collect child process data, section file descriptions, support for the latest Turing GPUs and Win10 RS5, new profiling options, as well as improved performance.
CUDA is a freeware project of GPU manufacturer Nvidia and is meant to help speed up mathematical calculations by making use of a graphics processing unit’s computing cores, which is most effective with parallelizable code